A dataset for programming language identification.
Rules for sample inclusion are:
The dataset is stored in the data
directory. It contains:
meta.yml
: metadata about the dataset and available languages.dataset.yml
: collection of all samples, with pointers sample paths relative to data
.Check a summary of the dataset at REPORT.md.
See CONTRIBUTING.md.
The tools
directory contains various Python utilities to maintain the dataset:
tools/gen_meta.py
: Generates data/meta.yml
. This is only needed when upgrading to a new github/linguist or acmeism/RosettaCodeData version.tools/harvest.py
: Fetches samples from GitHub.tools/vote.py
: Updates the vote
annotation.tools/lint.py
: Checks the dataset for potential problems.tools/prepare_commit.py
: Updates generated files, required before any commit.tools/classify_linguist.py
: Updates linguist labels.tools/classify_pygments.py
: Updates pygments labels.To run tools first create the virtual environment:
pip install poetry
poetry install
Then run the tool with python -m
:
poetry run python -m tools.gen_meta
Each sample in data
has its own license. Check the origin repository for details.
Everything else is licensed under the MIT License.