Open hsezhiyan opened 5 years ago
Some comments:
Hi Vincent,
I ran the model for the base case (no BPE, direct translation). However, it seems to be seriously overfitting.
The first image is the sequence loss (from the NLTK library) on the training and the second is the testing. We used the following hyper parameters:
batch_size: 128 num_layers: 4 model_dim: 512 num_heads: 8 linear_key_dim: 64 linear_value_dim: 64 ffn_dim: 1024 max_seq_length: 15
The max_seq_length limits the output of the decoder to 15 characters, which seemed reasonable. The other parameters we tried to emulate the original transformer as much as possible.
Should we proceed with BPE? Perhaps BPE will help solve the overfitting issue.
Hi Hari,
Let's dig a bit deeper into the details.
Hi Vincent,
Apologies for the delayed response.
1) Yes. More so, the model actually shuffles the data before training to ensure proper randomization. 2) Yes, a validation pass is done every 1000 steps. The validation loss is the curve presented in the second graph (the loss does not decrease). 3) The code we used did not have epochs, but instead steps. We ran for 60k steps (which we found to be standard across various academic papers). 4) Sorry, I meant 15 tokens. We did not treat any words as UNK and thought perhaps would be a better option. We can use UNK and other vocab reducing techniques described in other papers. 5) We have not used prediction accuracy. I will work on that tonight and tomorrow.
If possible, can we meet for a bit after tomorrow? We used the following implementation (you suggested it for ease of use):
https://github.com/DongjunLee/transformer-tensorflow
Sincerely, Hari
Plan of action:
1) One line to one line changes. Train on the large corpus, with 80% training and 20% testing. Explore various hyperparameters, and then use BPE. Train for 60,000 steps before trying new configuration. 2) Many line to one line changes. Experiment with various level of context using the best model parameters from step 1. 3) Possibly method/class level granularity 4) Possibly work on a tool inside a common IDE. 5) Possibly use beam search/Rocchio algorithm to provide user with various options.