Open slh1109 opened 2 months ago
My take:
Agree with @venkatzhub
IBM has Granite LLMs under Apache2 license but the training data is fairly limited with only 727 COBOL programs as compared to 4M+ C++ programs. Further, there is currently no coverage of PL/I, HLASM, REXX, JCL et al.
@venkatzhub I'm reading your email but will reply here to maintain the trail.
IBM references their project CodeNet with a detailed spreadsheet on each language and the quantity of accepted submissions, and further that the code sourced from two Japanese coding challenge websites. It's overwhelmingly C++ and Python.
@venkatzhub I'm reading your email but will reply here to maintain the trail.
IBM references their project CodeNet with a detailed spreadsheet on each language and the quantity of accepted submissions, and further that the code sourced from two Japanese coding challenge websites. It's overwhelmingly C++ and Python.
Thanks @markbsigler !
Project description
This project aims to collect a dataset of production COBOL and associated mainframe languages (JCL, REXX, PL/I) which Large Language Models (LLMs) can be fine-tuned on. It also aims to develop an evaluation suite to measure LLMs' ability to comprehend, explain, and write COBOL. This project will:
Dataset
The dataset should be composed of high quality COBOL code that is permissively licensed. The code should be representative of production COBOL applications, and should be cleaned of any personally identifiable information (PII).
Evaluation Suite
The evaluation suite should comprise a series of tasks that quantitatively measure an arbitrary LLM's ability to read and write COBOL. BloopAI's COBOLEval benchmark can be used as a foundation for the suite. It is a translation of the widely-used OpenAI developed HumanEval LLM benchmark to COBOL.
Statement on alignment with Open Mainframe Project Mission and Vision statements
Enable the mainframe to be more consumable by developers with a transparent experience in leveraging the value propositions of the mainframe.
Are there similar/related projects out there?
None that we are aware of for mainframe languages. Software Heritage archives decommissioned software systems of all languages.
External dependencies (including licenses)
https://github.com/BloopAI/COBOLEval (MIT)
Sponsor from TAC
Joe Bostian
Proposed Project Stage
Sandbox
License and contribution guidelines
unknown
Current or desired source control repository
Github
External dependencies (including licenses)
none, tbd
Initial committers
tbd
Infrastructure requests
tbd
Communication channels
email, Google Docs, Zoom meetings
Communication channels
Google docs
Website
none/tbd
Release methodology and mechanics
tbd
Social media accounts
none/tbd
Community size and any existing sponsorship
initial team of around half dozen: John Mertic jmertic@linuxfoundation.org; "Ed Airey" eairey@averisource.com; Elpida Tzortzatos elpida@us.ibm.com; Jim Porell jporell@rocketsoftware.com; Joseph Bostian jbostian@us.ibm.com; Leonard Santalucia lsantalucia@vicominfinity.com; Louis Knight-Webb louis@bloop.ai; Per Kroll per.kroll@broadcom.com; Venkatauday Balabhadrapatruni venkatauday.balabhadrapatruni@broadcom.com; Goran Begic goran.begic@broadcom.com; Gabriel Gordon-Hall gabriel@bloop.ai, Stephen Hodges shodges@averisource.com