Questions about the training settings

qwertier24 commented 5 years ago

Hi! I am really interested in this fascinating work. However, I have some questions about the training methods for the transformer model.

In the paper you mention the transformer model is trained with learning rate = 6e-4 but do not say which lr decay method you are using, which I am curious about. I am also curious about the number of layers in the encoder and decoder.

Could you please demonstrate more specifically about the training settings? It will be more convenient for someone like me who want to reproduce your results if you could just publish your training source codes.

Thank you very much!

ischlag commented 5 years ago

I don't think they use a learning rate decay. At least it is not mentioned anywhere. They do train on all tasks simultaneously (according to the first author) for 500k steps with the attention-is-all-you-need transformer architecture which is a 6 layer encoder and decoder with a hidden size of 512 and a dense filter size of 2048. The batch size is 1024 so you will need some serious compute in order to reproduce this. With this config trained on 4 V100 GPUs, you can do 50k steps in ~13h.

They used the tensor2tensor implementation of the transformer. So technically the code is public. Good luck with that.

Have you had any success @mayukuner?

davidsaxton commented 5 years ago

That's correct: no learning rate decay for the results reported in the paper. 6 layers in the decoder and encoder.

qwertier24 commented 5 years ago

@ischlag Not even close to success. I used transformer_base_v1 as the base parameter set and modified it a little by adding a constant lr scheduler, a warmup procedure, and other stuff like this:

PROBLEM=algorithmic_math_deepmind_all MODEL=transformer HPARAMS=transformer_base_v1

t2t-trainer --data_dir=$DATA_DIR --problem=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --worker_gpu=8 --hparams='max_length=64,num_hidden_layers=6,learning_rate=6e-4,learning_rate_schedule=constant*linear_warmup,learning_rate_constant=6e-4,clip_grad_norm=0.1,optimizer_adam_beta2=0.995,batch_size=8192'

Here I am using a batch size of 8192 because 8 GTX 1080TI is utilized and 1024 sentences have approximately 8192 * 8 tokens so I think this is not a problem.

And the result is:

By the way, I changed the dataset generator a little by random selecting training data from train-easy, train-medium, and train-hard to make the dataset's size approximately 2M. The validation set is sampled from interpolation. The code can be found here: https://gist.github.com/mayukuner/dd7f88c0309cc926f1b02cf596b010c4

@davidsaxton Have you used curriculum training? I don't think you have while I really couldn't figure out why I cannot reproduce your results. Am I missing something here?

ischlag commented 5 years ago

I'd highly recommend to not deviate from the hyperparameters that are given in the paper. The transformer architecture is rather sensible to those. Remove your schedule and set the batch_size to 1024. Then train for 500k steps. Make sure your accuracy is 1 for getting all output tokens right and 0 for getting even just one wrong (no per symbol accuracy is reported).

qwertier24 commented 5 years ago

@ischlag I am trying my best to get close to the settings in the paper.

As you can see, the batch_size here is the maximum number of tokens per batch per GPU, so overall each batch contains 8192*8 tokens, which is close to 1024 sentences per batch.

Plus, I did not use any scheduler except for the warmup, the learning rate curve is as follows:

Also, the reported accuracy_per_sequence is exactly the criterion for this dataset as the paper states.

So I guess I am not doing anything wrong here, right?

ischlag commented 5 years ago

Well, I'm not sure how I'm supposed to "see" that. If you are certain that batch_size is actually the number of tokens per GPU instead of the number of samples used for one step, then so be it.

I changed the dataset generator a little by random selecting training data

Are you sure this is not going to skip data? The tf.data pipeline might do some caching and only goes through the generator once. Unfortunately, it is virtually impossible for me to tell by looking at the t2t code.

qwertier24 commented 5 years ago

@ischlag Sorry I did not explain it well because I thought you are familiar with T2T. The generator in T2T generates 2 million (question, answer) pairs per module. So I have to change it to make it generate 2M samples in total. The data generation procedure has been verified.

ischlag commented 5 years ago

I'm somewhat familiar with it but I decided to not use it due to its obscurity. I'm just trying to help you here. We are working on reproducing it ourselves with a clean PyTorch implementation and I'll post the results one we managed.

That said, you should not have 2M samples in total but n * 2M where n is the number of modules (I think 56 or so). ~~Furthermore, it is not clear to me how you encode the characters. Your gist file says text_problems.VocabType.CHARACTER which indicates byte-level encoding. I might be wrong on that though.~~

If that also doesn't help then I'm out of ideas. As a dummy experiment, you could train only on numbers__place_value, which in my case takes ca. 3-5k steps to train for virtually 100% accuracy.

qwertier24 commented 5 years ago

@ischlag You are right, I missed per module in the paper. So I guess 2M * 56 sequences should be used for training. Thank you for your help and look forward to your result!

ischlag commented 5 years ago

@mayukuner I'm currently training 3 baselines with my PyTorch implementation. The best result so far is 50% accuracy on all interpolation data after 45k steps and improving. So this starts to look promising. However, this is with a learning rate of 1e-4, not 6e-4. The 6e-4 run is stuck at a loss of 3.15 and 0% train accuracy even after 50k steps.

@davidsaxton Are you sure your learning rate in the paper is 6e-4 and not 6e-5?

qwertier24 commented 5 years ago

@ischlag Have you clipped the gradients of the tensors? You may also try to use warmup in the beginning of the training stage. The LR of 6e-4 seems OK to me. With tensor2tensor, the model can be trained to have an accuracy of 70% on interpolation test after 300k steps.

ischlag commented 5 years ago

Yes, I'm clipping the gradient norm of the parameters at 0.1. 6e-4 doesn't work at all. Even 3e-4 doesn't work at all. I've been very carefully going through my implementation several times.

My parameters are initialized from U[-a,a] with a=sqrt(6/in_out_avg). I share the Embedding matrix with the last layer before the softmax. I only scale the Embedding by sqrt(d_model) and I scale the dot products by 1/sqrt(d_k). Beta1 0.9, beta2 0.995. With the default epsilon. The Embedding I scale just like in the official transformer code but I'm not sure why it is sqrt(d_model). The one for the keys makes sense though. @mayukuner are you doing the same?

I'm still training and I'm now at 60% interpolation accuracy after 120k steps. So it looks good, just not with the right learning rate for me.

davidsaxton commented 5 years ago

@ischlag I clipped the gradient absolute value not norm (i.e., |g_i| <= 0.1 for every gradient index i)

chauhanjatin10 commented 4 years ago

Hi @mayukuner , can you share your implementation with me?

google-deepmind / mathematics_dataset

Questions about the training settings #5