Closed se4u closed 7 years ago
Another difference I found in the code was that the global feature cnn_fc
(called a^g in the paper) is computed by using the final fully connected layer from ResNet in the function function net_utils.build_residual_cnn_fc(cnn, opt)
in `misc/net_utils.lua' instead of using a simple averaging as shown in equation (14) of the paper.
Do you know if this is the reason for the performance boost?
Hi @se4u
I guess the released model was not trained using the standard karpathy splits, maybe the whole train and val of mscoco split, which may give the drastically better performance. Additionally, could you please share the evaluation code with me? Thanks in advance.
@danieljf24 Hi, I have attached the output of the model coco_train.t7, cocotalk_vocab.json
on the test portion of the standard karpathy splits to this comment. I think it will be easier to independently verify the performance of the model using its output.
jiasen_b5.zip
But I have also uploaded my mildly modified copy of the standard eval code to https://github.com/se4u/coco-caption.git
The main file that I use is called myeval.py
and it just takes two files, the json file that I attached and a COCO-caption dataset file.
https://github.com/se4u/coco-caption/blob/master/myeval.py
It will be great if you could also independently verify my finding.
Hi @se4u , I used the pre-trained model coco_train.t7
to generate the prediction result, and the performance is also much higher than that reported in the paper. I am also confused.
performance with beam size of 3
Bleu1 | B2 | B3 | B4 | Cider | Meteor | Rouge |
---|---|---|---|---|---|---|
0.798 | 0.651 | 0.515 | 0.402 | 1.289 | 0.294 | 0.596 |
@danieljf24 thanks for reporting back . In the meanwhile, I also tried to train the model using the hyper-parameters/training script in this repo and the performance is about "0.05" lower on almost all metrics.. I.e.
Bleu1 | B2 | B3 | B4 | Cider | Meteor | Rouge | |
---|---|---|---|---|---|---|---|
InPaper | 0.742 | 0.580 | 0.439 | 0.332 | 1.085 | 0.266 | 0.549 |
My train from scratch | 0.703 | 0.532 | 0.398 | 0.299 | 0.922 | 0.244 | 0.521 |
Have you tried to train the model from scratch? Were you able to reproduce the reported numbers in the paper?
@se4u Did you finetune the ResNet? The following is my results training from scratch without finetuning the ResNet?
Bleu1 | B2 | B3 | B4 | Cider | Meteor | Rouge |
---|---|---|---|---|---|---|
0.693 | 0.520 | 0.385 | 0.288 | 0.867 | 0.236 | 0.509 |
Yes, the results that I have showed are after fine-tuning the resnet. I also trained the model without finetuning, the results from that are very similar to yours:
Bleu1 | B2 | B3 | B4 | Cider | Meteor | Rouge | |
---|---|---|---|---|---|---|---|
MyTrain NoFT | 0.695 | 0.522 | 0.386 | 0.287 | 0.879 | 0.237 | 0.511 |
Could you also try finetuning the resnet ? Maybe you will be successful in reproducing the result from the paper.
I guess the released model whose performance is much better than the result reported in the paper is a ensemble of multiple models.
@JaneLou The released model parameters are loaded into a single model and inference is done without any averaging/voting therefore it is not an ensemble.
@se4u all right! Have you reproduced the result reported in the paper? The result from the model trained by myself is similar to that you showed above.
@JaneLou my guess is that the released model was trained on both the train and val portions however this doesn't answer why there are two released model versions. I think the next step will be to use these models to tag the test portion of the MSCOCO dataset and to compare the true performance on the coco server to the reported performance. I will do that and report back here with the result.
@jiasenlu It will be great if you chime in to clarify :)
Hi @se4u @JaneLou . Sorry for the late reply. I've been busy on something else, and didn't check the github. Yes. @se4u is right. I think I put the challenge model under the coco_train folder. I just update another model, and also test on my side. The result is
DataLoader loading json file: /data/coco/cocotalk.json
vocab size is 9567
DataLoader loading h5 file: /data/coco/cocotalk.h5
read 123287 images of size 3x256x256
max sequence length in data is 16
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
save/model_id1_36.t7
rnn_size: 512 num_layers: 1
input_encoding_size: 512
dropout rate: 0.5
total number of parameters in LM: 17422177
total number of parameters in CNN_conv: 57992704
constructing clones inside the LanguageModel
=> evaluating ...
[=================== 500/500 =================>] Tot: 2m1s | Step: 235ms
./misc/call_python_caption_eval.sh val1.json annotations/coco.json
File "myeval.py", line 24
print 'using %d/%d predictions' % (len(preds_filt), len(preds))
^
SyntaxError: invalid syntax
{
Bleu_1 : 0.742
ROUGE_L : 0.549
SPICE : 0.194
METEOR : 0.266
Bleu_4 : 0.332
Bleu_3 : 0.439
Bleu_2 : 0.58
CIDEr : 1.085
}
maybe you can take look again?
@se4u I think the number without finetuing looks good to me. Cider around 0.9. When you finetune the CNN, please set the learning rate a little smaller. (I think in my case, I set 1e-4)
@jiasenlu Thanks a lot for the update, this clears up a lot of the issues:
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
The above snippet shows that you are using 5k+5k for val and test and everything else for training. This certainly makes sense on its own. However I was under the impression that karpathy splits
meant that only the 82k images that are in the original training set should be used for training. I got this impression from the LRCN code. Specifically the following file on Jeff Donahue's recurrent branch that downloads a tar.gz
file that splits the dataset into 82k,5k,5k. Adding another 30k images into the training set should definitely boost up the performance so now I feel pretty confident of being able to reproduce the results.
Your max sequence length in the above log is smaller than the one in paper, in paper its 18, in log 16, actually its impressive that even after curbing the length for ~8% of captions the performance is still so high. I guess this hyperparameter is not that important as long as its set to a reasonable value. But knowing that you can set it low so that training can be a little faster and still get a high value is useful.
@jiasenlu Finally, regarding the finetuning of the CNN thanks for the tip, I did see that you were using two different learning rates for the LSTM part and the CNN and the CNN's learning rate was currently set to 1e-5 while the LSTM's learning rate was either (4e-4, (5e-4 in paper) ).
If I understand correctly you are saying that while finetuning the CNN you set the lr of the CNN to be low, so that it's parameters dont change too much, and set the lr of the LSTM to be even lower to 1e-4 ?
The above snippet shows that you are using 5k+5k for val and test and everything else for training. This certainly makes sense on its own. However I was under the impression that karpathy splits meant that only the 82k images that are in the original training set should be used for training. I got this impression from the LRCN code. Specifically the following file on Jeff Donahue's recurrent branch that downloads a tar.gz file that splits the dataset into 82k,5k,5k. Adding another 30k images into the training set should definitely boost up the performance so now I feel pretty confident of being able to reproduce the results.
I'm not sure what LRCN did, but this is based on neuraltalk 2 split. I think most recent image caption papers all based on that split. And COCO challenge result can also show something, right?
@se4u As for the splits you have discussed, in my opinion, the splits in this repo and neuraltalk2 (random split, mix up images in train2014/ and val2014/) are different from the Karpathy's split (https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip and the val and test split only select images from val2014/ folder)
But I still don't understand why the pre-trained model provided by the author can have such a high score. Maybe training from scratch can generate the result in the paper (CIDEr 1.085) So how did the pre-trained model be trained?
My eval result:
model | split | CIDEr |
---|---|---|
model_id1_36.t7 | random split, test | 1.032 |
model_id1_34.t7 | random split, test | 1.235 |
model_id1_36.t7 | Karpathy's split, test | 1.237 |
model_id1_34.t7 | Karpathy's split, test | 1.219 |
all the models above are without finetuned.
Hi Jiasen,
The performance of your released model,
(coco_train.t7, cocotalk_vocab.json)
seems to be much better than the performance reported in the highlighted row from table 1 in your paper (screenshot attached). I feel that I must be understanding something incorrectly about the code/models.My understanding is that the following models were trained using the standard karpathy splits of the mscoco captions and that model (1) was used to generate the lowermost results in table 1.
However when I test the predictions of these models on the test portion of karpathy splits then all the metrics are much higher than the ones reported in table 1 of the paper. Do you have any idea why the eval metrics might be so much better than the ones reported in the paper? I have tested my own evaluation code as well and tested it by reproducing the results in the LRCN paper so I am fairly sure that my evaluation code is correct.