transformer should achieve around 1.29

homelifes commented 4 years ago

Hello @ruotianluo and thanks for your code. I've seen that some papers report the results of the pure transformer applied for image captioning around 1.285 and some 1.29. For example, in the paper X-Linear Attention Networks for Image Captioning, the author reports 128.3 cider after self-critical, and in Multimodal Transformer with Multi-View Visual Representation for Image Captioning, they report 1.292. But your code can achieve 1.266 only. Do you have any clue what the problem may be? Note that they use the PostLN version (meaning that layer norm is applied after the output and not at the input, which is the original form proposed in Attention is all you need paper), but I guess that is not the reason they get better scores. Please share any suggestions you have.

luo3300612 commented 4 years ago

I have run transformer in this repo and get 1.2798 cider. As said in MODEL_ZOO.md, the number is not selected. Maybe the author just run once and got 1.266. So it's possible that transformer in this repo can reach 128.3 in X-Linear Attention Networks for Image Captioning.

ruotianluo commented 4 years ago

Yes, I agree. I ran it with reduce on plataeu on which limits its performance. I will update it.

fawazsammani commented 4 years ago

I also ran it. For XE i got 1.157 and for self-critical without reduce on plataeu, I got 1.275, which is quite near to 1.283. @ruotianluo something you might want to add also is the instance normalization of the query. As reported in the paper: Normalized and Geometry-Aware Self-Attention Network for Image Captioning, it can increase the cider score up to 1.31. Just one line of code. But actually since your code uses adaptive features, the computation of the mean and std has to be masked, and therefore nn.InstanceNorm1d is not suitable. I wrote the code yesterday here if you'd like, but haven't tried it on the transformer yet.

ruotianluo commented 4 years ago

Thank you!

ruotianluo commented 4 years ago

BTW, what schedule do you use to get 1.157

fawazsammani commented 4 years ago

same as your configuration: noamopt with 20K warmup steps. I just set the label smoothing to 0.1. Please note that 1.157 is after beam search with size 3. But actually that was the best checkpoint. The other ones were mostly 1.146. Another thing i tried is following the paper On layer normalization in the transformer architecture (trained it with Adam without warmup), i could achieve even better score, around 1.168. But the problem is it acheives worse results on self-critical training compared to using noamopt. My analysis is noamopt gives some potential to self-critical training.

ruotianluo commented 4 years ago

Thanks.

fawazsammani commented 4 years ago

one more thing I forgot, i also used weight tying in the decoder, following the original paper.

homelifes commented 4 years ago

thanks all!

@luo3300612 may I know if you've changed any thing or any hyperparameters in the code to achieve this score?

luo3300612 commented 4 years ago

thanks all!

@luo3300612 may I know if you've changed any thing or any hyperparameters in the code to achieve this score?

I change nothing. Just use config in configs/transformer.yml

ruotianluo commented 4 years ago

@homelifes you can try learning rate 5e-6 and no reduce_on_plataeu when doing SCST. It can get higher than 1.266. I will update the model zoo soon.

ruotianluo commented 4 years ago

@homelifes You can also try to increase the patience in https://github.com/ruotianluo/self-critical.pytorch/blob/master/tools/train.py#L97, maybe to 10? I haven't tried.

homelifes commented 4 years ago

actually @ruotianluo I set the patience to 1. meaning if there is no improvement for one epoch, I decay the learning rate by half (multiply by 0.5). Is that fine?

ruotianluo commented 4 years ago

@homelifes that would be really bad. If the learning rate gets too small, the model will stop updating.

homelifes commented 4 years ago

but actually when the learning rate decayed, it improved for one epoch directly, and then stopped to improve. I will re-try it with patience 10

fawazsammani commented 4 years ago

Hello @ruotianluo. I've seen that you updated the code and scores. Thanks a lot for your work. However, i have a question about the transformer step schedule used in Normalized Attention Paper which you included in the configs.

The authors mention this:

The base learning rate is set to min(t × 10−4 ; 3 × 10−4 ), where t is the current epoch number that starts at 1. After 6 epochs, the learning rate is decayed by 1/2 every 3 epochs

This means that the learning rate from epoch t = 1 till 6 is: 1e-4, 2e-4, 3e-4, 3e-4, 3e-4, 3e-4. It is then reduced by 0.5 every 3 epochs.

However, the way you do it it: opt.current_lr = opt.learning_rate * (iteration+1) / opt.noamopt_warmup And you start at 3e-4 (according to .yml file). At epoch 3, you decay it by half, so your learning rate is going (assuming the iteration number is 11k): 3e-4, 9.9e-5, 1.9e-4, 3e-4....then decayed by half every 3 epochs

There is an obvious difference. I actually tried the instance normalization on your code, as discussed in their paper, but it achieves worse results than the original transformer, So I assume the above schedule is necessary.

ruotianluo commented 4 years ago

@fawazsammani i feel like this level of difference should not be fundamentally different. You can try. (I tried the instance norm too, the same here)

upccpu commented 4 years ago

@fawazsammani i feel like this level of difference should not be fundamentally different. You can try. (I tried the instance norm too, the same here)

Hi, ruotianluo. In my opinion, the training epochs should increase(more than 15, i set 25) if you use learning rate decay strategy. The model takes longer to converge.

fawazsammani commented 4 years ago

@upccpu no matter how good the improvement for XE is, the SCST results are always similar to the one trained with the conventional warmup training schedule. I Tried 2 times. Once I got 1.152, and once 1.166, but the SCST scores are always around 1.276-1.278

upccpu commented 4 years ago

Yeah. I agree with you. But, i only obtained 1.150(less than 1.166) with the transfomer_step setting (in /Config/transformer/transformer_step). And, i can obtain 1.182 when using my own algorithm(same configuration). However, the scst scores are only 1.297. This is not the result I expected. It's confusing.

fawazsammani commented 4 years ago

@upccpu it seems correct. For transformer its usually 12 points from XE checkpoint under SCST

upccpu commented 4 years ago

@upccpu it seems correct. For transformer its usually 12 points from XE checkpoint under SCST

To be honest, the imporvements of SCST depend on the algorithmic innovation, rather than the change in learning rate strategies. If only changing the learning rate strategies to achieve a better results in XE, the SCST will not benifit from it. However, if investiagting a novel structure to achieve a better results in XE, the SCST will benifit from it. Is it right?

fawazsammani commented 4 years ago

@upccpu It makes sense. As I told you in the previous issue, the XE scores increase but SCST are always similar among different training schedule techniques. Ruotianluo also claimed the same thing in the model zoo. However I cannot confirm this because two papers here and here claimed that they got >=1.288 for transformer baseline (6 layers).

upccpu commented 4 years ago

@fawazsammani It also confuses me. Maybe, they use some tricks. Thanks for your advice.

upccpu commented 4 years ago

Hi, @fawazsammani . About the instance normalization. In my opinion, it is not necessary to consider the mask. If there is a feature with such demensions[B, C, W, H], Instance normalization mainly operates on the demensions[W,H]. As for a adaptive feature[B, C, box_num], 'box_num' represents [W, H]. So, we just need caculate mean and std on 'box_num' demension. You can see the 'Layer_Norm' function in ruotianluo's code.

fawazsammani commented 4 years ago

@upccpu if you don't mask, the mean and standard deviation is calculated over the pads as well. As a simple example: (1 + 2 + 3 + 0 + 0) / 5 = 1.2, while the answer should actually be 2. Anyways, I tried without masking as well. Still achieves worse results. I forgot about this instance norm thing long time ago.

fawazsammani commented 4 years ago

@upccpu by the way you can have a look at my implementation of the masked instance norm here. You can verify the implementation by without the masking (the case where there is no mask), it gives the same result as the PyTorch instance norm. Then you can check that the result with and without masking (in the case of when there is a mask) and see that they are different.

upccpu commented 4 years ago

@fawazsammani Thanks ~.~. I am conducting the experiments based on your masked instance norm. Thanks for your reply again. I help me a lot.

ruotianluo / self-critical.pytorch

transformer should achieve around 1.29 #196