Data available - Githubissues

Data is temporarily available on a Google Drive folder, so it's time to move on to the modeling!

The link points to a folder in which two files are of interest: database.csv and files.zip. Some details:

database.csv contains one row per aligned diff. It includes a wide range of possible bug-fixes (see header column). Alignment is guaranteed by extractor, all these commits were tagged as a fix, and I heuristically filtered out commits that seem to add/remove only comments or a string value. We will initially focus on the simplest case, which as a "1" in the columns: files_changed, java_files_changed, file_diffs, lines_removed, lines_added (i.e. commits changing only a single Java file in one place, removing and adding one line). I found 13,863 of these here.
files.zip contains two folders: files-pre and files-post, with the respective pre- and post-commit versions of the files in database.csv. These files are already tokenized (tab- and new-line separated). To build the corpus, just traverse database.csv, reading in the corresponding files (pre and post) and extracting the corresponding lines. Note that line indices in database.csv are 1-based. For the first step (see above), that means you will get an aligned pair of a single line before and after each valid commit (~13K).

After extracting your corpus and splitting the tokens, you should be able to build a representative vocabulary (and optionally learn a BPE encoding on that vocabulary) and then train any typical seq2seq model. Later on, we can relax these constraints to include e.g. also commits in which more than one file was changed, but just one Java file, as well as multiple aligned pairs and/or pairs with multiple lines.

hsezhiyan / CodePatching289G

Data available #11