The performance of released model is drastically better than performance reported in paper

se4u commented 7 years ago

Hi Jiasen,

The performance of your released model, (coco_train.t7, cocotalk_vocab.json) seems to be much better than the performance reported in the highlighted row from table 1 in your paper (screenshot attached). I feel that I must be understanding something incorrectly about the code/models.

My understanding is that the following models were trained using the standard karpathy splits of the mscoco captions and that model (1) was used to generate the lowermost results in table 1.

URL=https://filebox.ece.vt.edu/~jiasenlu/codeRelease/AdaptiveAttention
wget $URL/model/COCO/coco_train/coco_train.t7 # (1)
wget $URL/data/COCO/cocotalk_vocab.json -O coco_vocab.json 
wget $URL/model/COCO/coco_challenge/model_id1_34.t7 -O coco_challenge_model_id1_34.t7 # (2)
wget $URL/data/COCO/cocotalk_challenge_vocab.json -O coco_challenge_vocab.json

However when I test the predictions of these models on the test portion of karpathy splits then all the metrics are much higher than the ones reported in table 1 of the paper. Do you have any idea why the eval metrics might be so much better than the ones reported in the paper? I have tested my own evaluation code as well and tested it by reproducing the results in the LRCN paper so I am fairly sure that my evaluation code is correct.

	Bleu1	B2	B3	B4	Cider	Meteor	Rouge
InPaper	0.742	0.580	0.439	0.332	1.085	0.266	0.549
MyEval (1)	0.794	0.647	0.513	0.403	1.287	0.293	0.595
MyEval (2)	0.782	0.628	0.485	0.368	1.219	0.285	0.580

se4u commented 7 years ago

Another difference I found in the code was that the global feature cnn_fc (called a^g in the paper) is computed by using the final fully connected layer from ResNet in the function function net_utils.build_residual_cnn_fc(cnn, opt) in `misc/net_utils.lua' instead of using a simple averaging as shown in equation (14) of the paper.

Do you know if this is the reason for the performance boost?

danieljf24 commented 7 years ago

Hi @se4u
I guess the released model was not trained using the standard karpathy splits, maybe the whole train and val of mscoco split, which may give the drastically better performance. Additionally, could you please share the evaluation code with me? Thanks in advance.

se4u commented 7 years ago

@danieljf24 Hi, I have attached the output of the model coco_train.t7, cocotalk_vocab.json on the test portion of the standard karpathy splits to this comment. I think it will be easier to independently verify the performance of the model using its output. jiasen_b5.zip

But I have also uploaded my mildly modified copy of the standard eval code to https://github.com/se4u/coco-caption.git

The main file that I use is called myeval.py and it just takes two files, the json file that I attached and a COCO-caption dataset file. https://github.com/se4u/coco-caption/blob/master/myeval.py

It will be great if you could also independently verify my finding.

danieljf24 commented 7 years ago

Hi @se4u ， I used the pre-trained model coco_train.t7 to generate the prediction result, and the performance is also much higher than that reported in the paper. I am also confused.

performance with beam size of 3

Bleu1	B2	B3	B4	Cider	Meteor	Rouge
0.798	0.651	0.515	0.402	1.289	0.294	0.596

se4u commented 7 years ago

@danieljf24 thanks for reporting back . In the meanwhile, I also tried to train the model using the hyper-parameters/training script in this repo and the performance is about "0.05" lower on almost all metrics.. I.e.

	Bleu1	B2	B3	B4	Cider	Meteor	Rouge
InPaper	0.742	0.580	0.439	0.332	1.085	0.266	0.549
My train from scratch	0.703	0.532	0.398	0.299	0.922	0.244	0.521

Have you tried to train the model from scratch? Were you able to reproduce the reported numbers in the paper?

danieljf24 commented 7 years ago

@se4u Did you finetune the ResNet? The following is my results training from scratch without finetuning the ResNet?

Bleu1	B2	B3	B4	Cider	Meteor	Rouge
0.693	0.520	0.385	0.288	0.867	0.236	0.509

se4u commented 7 years ago

Yes, the results that I have showed are after fine-tuning the resnet. I also trained the model without finetuning, the results from that are very similar to yours:

	Bleu1	B2	B3	B4	Cider	Meteor	Rouge
MyTrain NoFT	0.695	0.522	0.386	0.287	0.879	0.237	0.511

Could you also try finetuning the resnet ? Maybe you will be successful in reproducing the result from the paper.

JaneLou commented 7 years ago

I guess the released model whose performance is much better than the result reported in the paper is a ensemble of multiple models.

se4u commented 7 years ago

@JaneLou The released model parameters are loaded into a single model and inference is done without any averaging/voting therefore it is not an ensemble.

JaneLou commented 7 years ago

@se4u all right! Have you reproduced the result reported in the paper? The result from the model trained by myself is similar to that you showed above.

se4u commented 7 years ago

@JaneLou my guess is that the released model was trained on both the train and val portions however this doesn't answer why there are two released model versions. I think the next step will be to use these models to tag the test portion of the MSCOCO dataset and to compare the true performance on the coco server to the reported performance. I will do that and report back here with the result.

@jiasenlu It will be great if you chime in to clarify :)

jiasenlu commented 7 years ago

Hi @se4u @JaneLou . Sorry for the late reply. I've been busy on something else, and didn't check the github. Yes. @se4u is right. I think I put the challenge model under the coco_train folder. I just update another model, and also test on my side. The result is DataLoader loading json file: /data/coco/cocotalk.json
vocab size is 9567
DataLoader loading h5 file: /data/coco/cocotalk.h5
read 123287 images of size 3x256x256
max sequence length in data is 16
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
save/model_id1_36.t7
rnn_size: 512 num_layers: 1 input_encoding_size: 512
dropout rate: 0.5
total number of parameters in LM: 17422177
total number of parameters in CNN_conv: 57992704
constructing clones inside the LanguageModel
=> evaluating ...
[=================== 500/500 =================>] Tot: 2m1s | Step: 235ms
./misc/call_python_caption_eval.sh val1.json annotations/coco.json
File "myeval.py", line 24 print 'using %d/%d predictions' % (len(preds_filt), len(preds)) ^ SyntaxError: invalid syntax { Bleu_1 : 0.742 ROUGE_L : 0.549 SPICE : 0.194 METEOR : 0.266 Bleu_4 : 0.332 Bleu_3 : 0.439 Bleu_2 : 0.58 CIDEr : 1.085 }

maybe you can take look again?

jiasenlu commented 7 years ago

@se4u I think the number without finetuing looks good to me. Cider around 0.9. When you finetune the CNN, please set the learning rate a little smaller. (I think in my case, I set 1e-4)

se4u commented 7 years ago

@jiasenlu Thanks a lot for the update, this clears up a lot of the issues:

assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test

The above snippet shows that you are using 5k+5k for val and test and everything else for training. This certainly makes sense on its own. However I was under the impression that karpathy splits meant that only the 82k images that are in the original training set should be used for training. I got this impression from the LRCN code. Specifically the following file on Jeff Donahue's recurrent branch that downloads a tar.gz file that splits the dataset into 82k,5k,5k. Adding another 30k images into the training set should definitely boost up the performance so now I feel pretty confident of being able to reproduce the results.
Your max sequence length in the above log is smaller than the one in paper, in paper its 18, in log 16, actually its impressive that even after curbing the length for ~8% of captions the performance is still so high. I guess this hyperparameter is not that important as long as its set to a reasonable value. But knowing that you can set it low so that training can be a little faster and still get a high value is useful.

se4u commented 7 years ago

@jiasenlu Finally, regarding the finetuning of the CNN thanks for the tip, I did see that you were using two different learning rates for the LSTM part and the CNN and the CNN's learning rate was currently set to 1e-5 while the LSTM's learning rate was either (4e-4, (5e-4 in paper) ).

If I understand correctly you are saying that while finetuning the CNN you set the lr of the CNN to be low, so that it's parameters dont change too much, and set the lr of the LSTM to be even lower to 1e-4 ?

jiasenlu commented 7 years ago

The above snippet shows that you are using 5k+5k for val and test and everything else for training. This certainly makes sense on its own. However I was under the impression that karpathy splits meant that only the 82k images that are in the original training set should be used for training. I got this impression from the LRCN code. Specifically the following file on Jeff Donahue's recurrent branch that downloads a tar.gz file that splits the dataset into 82k,5k,5k. Adding another 30k images into the training set should definitely boost up the performance so now I feel pretty confident of being able to reproduce the results.

I'm not sure what LRCN did, but this is based on neuraltalk 2 split. I think most recent image caption papers all based on that split. And COCO challenge result can also show something, right?

jamiechoi1995 commented 6 years ago

@se4u As for the splits you have discussed, in my opinion, the splits in this repo and neuraltalk2 (random split, mix up images in train2014/ and val2014/) are different from the Karpathy's split (https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip and the val and test split only select images from val2014/ folder)

But I still don't understand why the pre-trained model provided by the author can have such a high score. Maybe training from scratch can generate the result in the paper (CIDEr 1.085) So how did the pre-trained model be trained?

My eval result:

model	split	CIDEr
model_id1_36.t7	random split, test	1.032
model_id1_34.t7	random split, test	1.235
model_id1_36.t7	Karpathy's split, test	1.237
model_id1_34.t7	Karpathy's split, test	1.219

all the models above are without finetuned.

jiasenlu / AdaptiveAttention

The performance of released model is drastically better than performance reported in paper #4