src-d / datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Other
323 stars 82 forks source link
dataset datasets git github machine-learning mlosc

source{d} Datasets Build Status Build status

source{d} datasets for source code analysis and machine learning on source code (ML on Code).

This repository contains all the needed tools and scripts to reproduce the datasets, as well as the academic papers they may relate to.

Available datasets

Public Git Archive

Programming Language Identifiers

Code duplicates

Pull Request review comments

Commit messages

Structural commit features

DockerHub Metadata

DockerHub Packages

Typos

NuGet Namespaces

Contributions

Contributions are very welcome, please see CONTRIBUTING.md and code of conduct.

License

The tools and scripts are licensed under Apache 2.0, see LICENSE.md.