hsezhiyan / CodePatching289G

Transformer Models for Code Patching
0 stars 0 forks source link

Steps Moving Forward #12

Open hsezhiyan opened 5 years ago

hsezhiyan commented 5 years ago

Plan of action:

1) One line to one line changes. Train on the large corpus, with 80% training and 20% testing. Explore various hyperparameters, and then use BPE. Train for 60,000 steps before trying new configuration. 2) Many line to one line changes. Experiment with various level of context using the best model parameters from step 1. 3) Possibly method/class level granularity 4) Possibly work on a tool inside a common IDE. 5) Possibly use beam search/Rocchio algorithm to provide user with various options.

VHellendoorn commented 5 years ago

Some comments:

hsezhiyan commented 5 years ago

Hi Vincent,

I ran the model for the base case (no BPE, direct translation). However, it seems to be seriously overfitting.

Screen Shot 2019-05-18 at 2 42 33 PM Screen Shot 2019-05-18 at 2 42 13 PM

The first image is the sequence loss (from the NLTK library) on the training and the second is the testing. We used the following hyper parameters:

batch_size: 128 num_layers: 4 model_dim: 512 num_heads: 8 linear_key_dim: 64 linear_value_dim: 64 ffn_dim: 1024 max_seq_length: 15

The max_seq_length limits the output of the decoder to 15 characters, which seemed reasonable. The other parameters we tried to emulate the original transformer as much as possible.

Should we proceed with BPE? Perhaps BPE will help solve the overfitting issue.

VHellendoorn commented 5 years ago

Hi Hari,

Let's dig a bit deeper into the details.

  1. First, is this a train/test split across different projects (and/or organizations)?
  2. Did you run a validation pass every epoch as well, maybe on a random 5-10% of the training data (meaning that set is not strictly separated by project), and if so, how did that turn out?
  3. Also, how many epochs did you run it for? The training set was quite small, so 10K minibatches might just be multiple epochs? (also, which implementation is this?)
  4. You mention 15 characters, which would be quite a bit less than many lines. Did you mean words? If so, what is the vocabulary? Are any words treated as UNK?
  5. Might be good to measure prediction accuracy (of the full prediction) besides BLEU score, since ultimately we care about getting the "correct" fix. Also, it might be worthwhile to delete any samples that were identical (bug and fix) from the training data; I didn't add this filter in my extraction, but this could occur if e.g. only the content of a string changed.
hsezhiyan commented 5 years ago

Hi Vincent,

Apologies for the delayed response.

1) Yes. More so, the model actually shuffles the data before training to ensure proper randomization. 2) Yes, a validation pass is done every 1000 steps. The validation loss is the curve presented in the second graph (the loss does not decrease). 3) The code we used did not have epochs, but instead steps. We ran for 60k steps (which we found to be standard across various academic papers). 4) Sorry, I meant 15 tokens. We did not treat any words as UNK and thought perhaps would be a better option. We can use UNK and other vocab reducing techniques described in other papers. 5) We have not used prediction accuracy. I will work on that tonight and tomorrow.

If possible, can we meet for a bit after tomorrow? We used the following implementation (you suggested it for ease of use):

https://github.com/DongjunLee/transformer-tensorflow

Sincerely, Hari