hsezhiyan / CodePatching289G

Transformer Models for Code Patching
0 stars 0 forks source link

Updated Repo #13

Open hsezhiyan opened 5 years ago

hsezhiyan commented 5 years ago

Hello Vincent,

We just combined the data processing and model repos into one. We'll send you a few more updates later today.

hsezhiyan commented 5 years ago

Hi Vincent,

We tried the following approaches:

1) BPE using SentencePiece repo (https://github.com/google/sentencepiece). Loss curve below:

p1

2) BPE using SentencePiece and replacing all numerical values with the token NUM. The vocab size reduces about 5k more. Loss curve below:

Screen Shot 2019-05-24 at 12 52 15 AM

How do you think we can proceed?

VHellendoorn commented 5 years ago

Hi, thanks for the info. Can you please share a bunch more details first (or come by the lab today and show me):

hsezhiyan commented 5 years ago

Hi Vincent,

I apologize. I couldn't make it to the lab today because of a few classes. Would you be free to have a video call anytime today or tomorrow? I can meet whenever you're free.

Some more details: 1) Vocabulary size after this is ~41000. Currently, we only removed NUM, but we can do more if you think that's helpful. 2) Given the batch size is 128, and there is about 60,000 data lines, it's ~470 x steps per epoch. When I looked at some papers on this topic, they trained in excess of 60k epochs. I believe you said training for more than 10 epochs is unnecessary. 3) The average sequence is about 12 tokens, we limit the output to 15 tokens. None (to my knowledge were identical), but some were very close. 4) Yes; from what I understand with the repo, it trains on the training set, and validates with the test set. This validation losses comes from the test set. In deep learning, is the validation loss always from the test set? 5) Yes; the BLEU scores increases a bit, and then drops off. Here is a plot:

Screen Shot 2019-05-24 at 3 35 46 PM
VHellendoorn commented 5 years ago

Thanks for the details. The bleu score looks pretty good actually; looks like it peaks at about 40%. A few more clarification questions: I recall there were ~13K 1-1 line pairs, so if you say 60K lines, I assume you added some more (e.g. 1-3 to 1-3 line pairs)? Furthermore, based on 60K lines and 12 tokens/line, your total training data should be about 720K tokens, which at a minibatch size of 128, implies 5.6K steps per epoch. Is that right?

If these numbers are roughly correct, it's probably wise to reduce the vocabulary size, either to tokens that occur e.g. at least 5 times in the training data, or by pre-processing the data with BPE (to e.g. 2K tokens), or even by heuristically splitting on camelCase and under_scores. Note that you can just run this completely outside of the training loop; just make a separate pre-processed dataset. This is probably pretty important, because 40K vocabulary on less than 1M token suggests a ton of identifier diversity that might make it very hard for the model to learn to translate, especially across projects. On this note, didn't you already use BPE, and if so, with what vocabulary size, and how did that training curve differ from the previous one? Is that a different result from the BLEU curve above?

Regarding the validation data, I just want to emphasize that it should come out of the training data, not the test data. Specifically, the test set can be a set of held-out projects, the training data should be a random subset of held-out lines from the training data (so they may be from a project that is also in the training data). I'd like to see the curves for both of these, because it will capture the hardness of generalizing here, in the contrast of intra-project (validation loss/accuracy) and cross-project (test loss/accuracy)

Finally, it's probably best to check several random samples manually, specifically actually look up the commit and make sure that the translation task does capture the diff.

hsezhiyan commented 5 years ago

Hi Vincent,

Thanks for the detailed response. A few clarifications: 1) The BLEU score is scaled to a percentage, so we're getting a 0.4% score, not 40%. 2) Apologies, I meant to say we used 13k lines for training. We simply used all 60k lines to train our BPE model. So the correct numbers are as follows: 13k lines with about 12 tokens/line, 256k tokens, and with a minibatch of 128, that comes to about 2k steps per epoch. Regardless of the change in numbers, I believe we should still use heuristics to reduce vocabulary. Do you agree? With our trained BPE, we're able to reduce vocabulary size to 38k. 3) I understand what you mean by validation set now. I will see if I can implement that into the existing codebase. I will send you the plots from that too when I get it.

VHellendoorn commented 5 years ago

Thanks for clarifying. One thing is still unclear: what is your current vocabulary size? And how large is it if you raise the vocabulary cut-off to e.g. 2, 3, or 5 (i.e. how many of these words only occur less than twice, three times, or five times). From the numbers I have seen, this seems like the single biggest obstacle to training well.

I think part of the confusion may lie in what a vocabulary is; a vocabulary counts the number of distinct tokens that you will treat as "seen" during training and testing. Typically, if your training data is 256K tokens, I would expect many of those to be the same token (e.g. i, int, String), and many rare tokens to occur just once or twice. As a result, you would typically end up with a vocabulary of about 5-25K distinct tokens on such a corpus, depending on how much variation there is in your test set.

When using a technique like BPE, you specify how many words you want there to exist. Typically, you choose values like 2K or 5K. It then fuses characters together until it reaches that limit and stops. That means that you will never get a vocabulary of 38K out of BPE (unless you manually set it to produce on that size). And more broadly, if the vocabulary really is ~40K, I wouldn't expect my deep learner to do well because that is a tremendous amount of linguistic diversity -- most of the words in the test data must be new entirely.

So let's get a handle on this. Please send me the vocabulary file containing all distinct words in your corpus and their count (frequency) in your training data. Then, train a BPE model on your data with a vocabulary cut-off of 2,000 (make sure your model respects tab-separated tokens!) and send me the resulting vocabulary as well. Once we are confident the model is only trying to capture 2K tokens (and feel free to print out the shape of the embedding/softmax layers to confirm that this is really happening), let's run it again.

hsezhiyan commented 5 years ago

Hi Vincent,

Apologies for the delayed response.

I understand what you mean by vocabulary size. I will use the BPE to restrict the size to 2K and 5K and try out the model with this.

As for the vocabulary file, you can look in (in this repo): ECS289_Transformer/transformer_tensorflow/data/processed-large-java/source_vocab

This vocab file currently has a file size of 41k.

We are running the BPE model with a vocab size of 2K now and will send you the new vocabulary file (with 2k vocab size) when the BPE model is done.

Thank you, Hari

FernandoPieressa commented 5 years ago

@VHellendoorn

Created both the vocabulary file containing all distinct words with their frequency in the training data as well as the vocabulary with BPE cut-off (with 4500 unique tokens) and their frequency.

I uploaded them on a folder called "Vocabulary" in the transformer folder so you can check them.

VHellendoorn commented 5 years ago

Thanks! Looking forward to the model results with BPE preprocessing. One small note: many of the BPE tokens start with something like __. Is that a placeholder for a tab? Might be good to make sure that your BPE implementation is splitting on tabs, not spaces if so!

hsezhiyan commented 5 years ago

Hi Vincent,

We tried running a much smaller model (batch size 64, model dimension 30), and got better results (still not very good). Attaching them here:

bleu

loss_test

train_loss

The train loss goes down (as expected), but the test doesn't. We noticed the model fails to even produce similar tokens (it's often very random). So we are trying to do the following: train the model to first predict itself, use this as pre-training, and then train this pre-trained model on the actual code patching task. We do this so the model at least learns to predict similar tokens before trying the much harder translation task.

VHellendoorn commented 5 years ago

Hi there, yeah I can't say for sure what causes this. I still suspect it's the data; I ran my own transformer on this and, although it doesn't necessarily get very good test loss, it certainly steadily decreases through the epochs with the standard parameters and the produced outputs are not crazy. It ends up at a per-BLEU-subtoken entropy of around 2.4 bits. I don't have a BLEU score built-in, so I can only say that the average per-sub-token accuracy was around 60%, so the translations should be somewhat alright. Training converged very quickly (basically within 1-3 epochs) and then just kind of hovered there, which makes sense because it's a very small dataset.

Since I didn't get your test loss behavior, I'm inclined to conclude that it is almost certainly an issue with your data processing pipeline, but I can't say for sure what the issue is. Perhaps some kind of discrepancy between the train and test data.