Open VHellendoorn opened 5 years ago
Quick follow-up, given a diff line, the corresponding file name (in files-pre/files-post respectively) can be produced as:
'files-pre/' + org + '#' + project + '#' + commit_id + '#' + file.replace("/", "__") + '#pre.java'
(substitute both occurrences of 'pre' for 'post' as needed)
Data is temporarily available on a Google Drive folder, so it's time to move on to the modeling!
The link points to a folder in which two files are of interest: database.csv and files.zip. Some details:
files-pre
andfiles-post
, with the respective pre- and post-commit versions of the files in database.csv. These files are already tokenized (tab- and new-line separated). To build the corpus, just traverse database.csv, reading in the corresponding files (pre and post) and extracting the corresponding lines. Note that line indices in database.csv are 1-based. For the first step (see above), that means you will get an aligned pair of a single line before and after each valid commit (~13K).After extracting your corpus and splitting the tokens, you should be able to build a representative vocabulary (and optionally learn a BPE encoding on that vocabulary) and then train any typical seq2seq model. Later on, we can relax these constraints to include e.g. also commits in which more than one file was changed, but just one Java file, as well as multiple aligned pairs and/or pairs with multiple lines.