smola / language-dataset

Dataset for programming language identification.
MIT License
21 stars 5 forks source link
dataset language-detection language-identification programming-language-identification

language-dataset

A dataset for programming language identification.

Methodology

Rules for sample inclusion are:

Dataset

The dataset is stored in the data directory. It contains:

Check a summary of the dataset at REPORT.md.

Contributing

See CONTRIBUTING.md.

Tooling

The tools directory contains various Python utilities to maintain the dataset:

To run tools first create the virtual environment:

pip install poetry
poetry install

Then run the tool with python -m:

poetry run python -m tools.gen_meta

License

Each sample in data has its own license. Check the origin repository for details.

Everything else is licensed under the MIT License.