hsezhiyan / CodePatching289G

Transformer Models for Code Patching
0 stars 0 forks source link

Data available #11

Open VHellendoorn opened 5 years ago

VHellendoorn commented 5 years ago

Data is temporarily available on a Google Drive folder, so it's time to move on to the modeling!

The link points to a folder in which two files are of interest: database.csv and files.zip. Some details:

After extracting your corpus and splitting the tokens, you should be able to build a representative vocabulary (and optionally learn a BPE encoding on that vocabulary) and then train any typical seq2seq model. Later on, we can relax these constraints to include e.g. also commits in which more than one file was changed, but just one Java file, as well as multiple aligned pairs and/or pairs with multiple lines.

VHellendoorn commented 5 years ago

Quick follow-up, given a diff line, the corresponding file name (in files-pre/files-post respectively) can be produced as: 'files-pre/' + org + '#' + project + '#' + commit_id + '#' + file.replace("/", "__") + '#pre.java' (substitute both occurrences of 'pre' for 'post' as needed)