ruotianluo / ImageCaptioning.pytorch

I decide to sync up this repo and self-critical.pytorch. (The old master is in old master branch for archive)
MIT License
1.44k stars 412 forks source link

Benchmarks #10

Open ruotianluo opened 7 years ago

ruotianluo commented 7 years ago

Cross entropy loss (Cider score on validation set without beam search; 25epochs): fc 0.92 att2in 0.95 att2in2 0.99 topdown 1.01

(self critical training is in https://github.com/ruotianluo/self-critical.pytorch) Self-critical training. (Self critical after 25epochs; Suggestion: don't start self critical too late): att2in 1.12 topdown 1.12

Test split (beam size 5): cross entropy: topdown: 1.07

self-critical: topdown: Bleu_1: 0.779 Bleu_2: 0.615 Bleu_3: 0.467 Bleu_4: 0.347 METEOR: 0.269 ROUGE_L: 0.561 CIDEr: 1.143 att2in2: Bleu_1: 0.777 Bleu_2: 0.613 Bleu_3: 0.465 Bleu_4: 0.347 METEOR: 0.267 ROUGE_L: 0.560 CIDEr: 1.156

SJTUzhanglj commented 7 years ago

is there any code or options, to show how to train any of these models (topdown, etc) with self-critical algorithm? @ruotianluo

ruotianluo commented 7 years ago

It's in my another repository

miracle24 commented 7 years ago

Did you fine-tune the CNN when trained the model with cross entropy loss?

ruotianluo commented 7 years ago

No.

miracle24 commented 7 years ago

Wow. It's unbelievable. I can not achieve that high score without fine-tune when train my own captioning model under cross entropy loss. Most papers I have read will fine-tune the CNN when train the model with cross entropy loss. Is there any tips when train the model with cross entropy?

ruotianluo commented 7 years ago

Finetuning is actually worse. It's about how to extract the features, check the self critical sequence training paper.

miracle24 commented 7 years ago

I think they means they did not do finetuning when trained the model under RL loss, while they did not mention whether they finetune the CNN when train the model under cross entropy loss.

miracle24 commented 7 years ago

I finetnue the CNN under cross entropy loss as neuraltalk2 (Lua version) and I got cider 0.91 on validation set without beamsearch. Then I train the self-critical model without finetuning based on the best pretrained model and I finally got cider almost close result compared with self-critical paper.

ruotianluo commented 7 years ago

They didn't fine-tune in both phase. And finetuning may not work as well under attention based model.

miracle24 commented 7 years ago

I did not train the attention based model. But I will try. Thank you and your codes. I will start learning pytorch with you code.

ahkarami commented 7 years ago

Dear @ruotianluo, Thank you for your fantastic code. Would you please tell me all of your used parameters for run the train.py code? (In fact, I used your code, as the guidance in the ReadMe file, but when I have used and tested the trained model, I got same result (i.e., same captions) for all of my different test images). It is worth noting that, I have used --language_eval 0, and maybe this wrong parameter caused these obtained results, am I correct?

ruotianluo commented 7 years ago

Can you try downloading the pertrained model and evaluate on your test images. It helps me to narrow down the problem.

ahkarami commented 7 years ago

Yes, I can download the pre-trained models and use them. The results from pre-Trained models were appropriate and nice; However, the obtained results from my Trained models were same for all of the images. It seems something wrong with my used parameters for training and the trained model produced same caption for all of given images.

ruotianluo commented 7 years ago

You should be able to reproduce my result following my instructions, it's really weird. Anyway which options are not clear to me (most of the options are explained in the opts.py)?

ahkarami commented 7 years ago

Thank you very much for your help. The problem has been solved. In fact, I have trained your code on another Synthetic data set, and as a result the error has been occurred. However, when I used your code on MS-COCO data set, the training process hasn't any problem. Just as another question, would you please kindly tell me the appropriate value of parameters for training? I mean the appropriate values for parameters such as _beamsize, _rnnsize, _numlayers, _rnntype, _learningrate, _learning_rate_decayevery, and _scheduled_samplingstart.

ruotianluo commented 7 years ago

@ahkarami is the previous problem related to my code? I think it varies from dataset to dataset. Beam size could be 5. The numbers I set are the same as in the readme.

ahkarami commented 7 years ago

Dear @ruotianluo, No, the previous problem related to my data set, and your code is correct. In fact, in my data set the repetitious words are many. Moreover, the length of sentences vary from ~15 up to 90 words. I have changed the parameters of the prepro_labels.py by --max_length = 50 & --word_count_threshold = 2 then after about 40 epochs, the produced results are not same for any given image; However the results were bad and not appropriate. I think still my parameters for training & pre-processing the labels are not appropriate.

xyy19920105 commented 6 years ago

Hi @ruotianluo , Thank you for your code and benchmark, did you test the adaptive attention on your code?? Could you output the adaptive attention's result?? Thank you again.

ruotianluo commented 6 years ago

Actually no. I didn't spend much time on that model.

xyy19920105 commented 6 years ago

Thanks for your reply. Do you think that the adaptive attention model is not good enough as a baseline??

ruotianluo commented 6 years ago

It's good, just I couldn't get it work well.

dmitriy-serdyuk commented 6 years ago

Could you clarify, which features are used for the results above? resnet152? And does fc stand for ShowTell?

ruotianluo commented 6 years ago

@dmitriy-serdyuk it's using res101. and FC stands for the FC model in self critical sequence training paper which can be regarded as a variant of showtell.

chynphh commented 6 years ago

Thank you for your fantastic code. I am a beginner, and it helped me a lot. I have a question about the 'LSTMCore' class in the FCModel.py. Why don't you use the official LSTM model and train it by step, or the LSTMCell model and add a dropout layer on it? Is there any difference between your code and them?

ruotianluo commented 6 years ago

The in gate is different. https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/master/models/FCModel.py#L34

chynphh commented 6 years ago

OK, I got it. But why do you make this change? Is there any paper or any research about this?

ruotianluo commented 6 years ago

Self-critical Sequence Training for Image Captioning https://arxiv.org/abs/1612.00563

chynphh commented 6 years ago

Thank you very much!

eriche2016 commented 6 years ago

i am wondering if you only use the 80K dataset to get such a high performance on validation set, or use 110K dataset? I am doing experiment on karpathy split and use 80K dataset, but i get only 0.72 in terms of cider when using only train set. If so, can you give me some tips on training the net.

eriche2016 commented 6 years ago

BTW, i am using show attend model for my experiment.

ruotianluo commented 6 years ago

@eriche2016 I use 110k.

eriche2016 commented 6 years ago

okay, i got it, thank you very much for your quick reply.

jamiechoi1995 commented 6 years ago

I use att2in2 pre-trained model, resnet 101 CNN features, and the evaluation result is:

Bleu_1: 0.752 Bleu_2: 0.588 Bleu_3: 0.448 Bleu_4: 0.339 computing METEOR score... METEOR: 0.264 computing Rouge score... ROUGE_L: 0.551 computing CIDEr score... CIDEr: 1.058 loss: 12.9450276334 {'CIDEr': 1.0579511410971039, 'Bleu_4': 0.33850444932429163, 'Bleu_3': 0.4475539789958938, 'Bleu_2': 0.588021344462357, 'Bleu_1': 0.7524049671248727, 'ROUGE_L': 0.5509140488261475, 'METEOR': 0.2637079091201445}

I am confused about the loss, it seems too high.

ruotianluo commented 6 years ago

@jiamiechoi1995 That's cross entropy loss, that's expected.

jamiechoi1995 commented 6 years ago

@ruotianluo so the pre-trained models include self critical training? I thought they only include MLE training, sorry.

ruotianluo commented 6 years ago

There is, but in other folders.

miracle24 commented 6 years ago

Hi. Can you tell more details about how to run att2in2 using self-critical? like how many epochs you pretrained att2in2 with XE loss, and after that, how many epochs you trained it with self-critical? If possible, could you provide the train script? Thanks a lot.

ruotianluo commented 6 years ago

Check out https://github.com/ruotianluo/self-critical.pytorch

miracle24 commented 6 years ago

I have read that. python train.py --id fc_rl --caption_model fc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-5 --start_from log_fc_rl --checkpoint_path log_fc_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30. But how many epoches did you train the model with self-critical ?

ruotianluo commented 6 years ago

I see. You can actually train for no matter how long you want. I think I train for additional 30 epochs.

miracle24 commented 6 years ago

Ok, I see. Thanks a lot.

upccpu commented 6 years ago

I achieved the ideas in my paper based on your code with resnet152 and cross entropy loss without ensemble. The evaluation result is: Bleu_1: 0.759 Bleu_2: 0.595 Bleu_3: 0.454 Bleu_4: 0.344 computing METEOR score... METEOR: 0.268 computing Rouge score... ROUGE_L: 0.556 computing CIDEr score... CIDEr: 1.090 The results exceeds the top-down model a lot with the same environment especially the cider score(1.090>>1.051). It is beyond my expectation.

mojesty commented 6 years ago

Hello! I have some questions about pretrained models performance. I tested top-down, Fully Connected and att2in models for several random images from Internet and found that they cannot describe images correctly (although top-down and att2in models produced syntactically correct sentences, e. g. "a woman sitting on a chair with a dog"). Also I visualized attention maps and they look more or less random for every model as well. So either my method of testing models is corrupt or the models themselves are not so good, I want to discuss this. Also, @upccpu could you please provide me a trained model?

YuanEZhou commented 5 years ago

opt.id = 'topdown' opt.caption_model = 'topdown' opt.rnn_size = 1000 opt.input_encoding_size = 1000

opt.batch_size = 100 Other configurations follow this repository.

Cross_entropy loss: ce_wo_constrain

Cross_entropy+self-critical: slightly better than the result reported in original paper. ce sc argmax

jamiechoi1995 commented 5 years ago

opt.id = 'topdown' opt.caption_model = 'topdown' opt.rnn_size = 1000 opt.input_encoding_size = 1000

opt.batch_size = 100 Other configurations follow this repository.

Cross_entropy loss: ce_wo_constrain

Cross_entropy+self-critical: slightly better than the result reported in original paper. ce sc argmax

@YuanEZhou which feature did you use? the default resnet101 feature or the bottom up feature

YuanEZhou commented 5 years ago

bottom up feature

jamiechoi1995 commented 5 years ago

bottom up feature

@YuanEZhou may I ask how did you use these features? Because I have a similar question in this issue: https://github.com/ruotianluo/self-critical.pytorch/issues/66

did you modify the code to incorporate bounding box information? Or just use the default options.

YuanEZhou commented 5 years ago

@jamiechoi1995 I use the default options.

jamiechoi1995 commented 5 years ago

Adaptive Attention model learning rate 1e-4 batch size 32 trained for 100 epochs I use the code in self-critical repo

{'CIDEr': 1.0295328576254532, 'Bleu_4': 0.32367107232015596, 'Bleu_3': 0.4308636494026319, 'Bleu_2': 0.5710839754137301, 'Bleu_1': 0.7375622419883233, 'ROUGE_L': 0.5415854013591195, 'METEOR': 0.2603669044858015, 'SPICE': 0.193603187345227 47}

fawazsammani commented 5 years ago

@YuanEZhou can you please share the results.json file you got from the coco caption code which includes all the image ids with their predictions for the validation images? I urgently need it. Your help is highly appreciated