google-research-datasets / great

The dataset for the variable-misuse task, used in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/forum?id=B1lnbRNtwr]
Other
22 stars 12 forks source link

great

The dataset for the variable-misuse task, described in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/forum?id=B1lnbRNtwr]

This is the public version of the dataset used in that paper. The original, used to produce the graphs in the paper, could not be open-sourced due to licensing issues. See the public associated code repository [https://github.com/VHellendoorn/ICLR20-Great] for results produced from this dataset.

This dataset was generated synthetically from the corpus of Python code in the ETH Py150 Open dataset [https://github.com/google-research-datasets/eth_py150_open].

The dataset is presented in 3 splits: the training dataset train, the validation dataset dev, and the evaluation (test) dataset eval. Each of these was derived from the corresponding split of ETH Py150 Open.

Each dataset split is stored in a sharded text file. Each shard is named <split>__VARIABLE_MISUSE__SStuB.txt-<shard number>-of-<number of shards>. Each of these files is a regular text file, in which every line contains a JSON-encoded example. Reading each line separately and then decoding it as JSON will reconstitute one example per line.

We chose numbers of shards to ensure no individual file is larger than the GitHub-imposed 100MB per-file limit.

Shards of each split are placed in separate subdirectories.

Each example has the following fields:

Each example is released under the license of the originating GitHub repository and project, as per the license field, described above. This means that the dataset comprises individual files licensed under different terms. Please take appropriate care when using this dataset.

Example code that uses the data is available in a separate repository [https://github.com/VHellendoorn/ICLR20-Great].