karpathy / neuraltalk2

Efficient Image Captioning code in Torch, runs on GPU
5.51k stars 1.26k forks source link

Am I the only one getting low CIDEr score with default setting?? I barely got 0.3~0.4 with 277500 iterations and what is CNN finetuning ? #64

Open spk921 opened 8 years ago

spk921 commented 8 years ago

Thank you for sharing wonderful code. I did exactly same thing with default setting. Only different is batch size 16 -> 32 LM parameters rnn size : 512, input encoding size : 512, batch size : 32, drop prob : 0.5, seq per img : 5, optmizer : adam , learning rate : 4e-4 , learning decay every 50000 , optim alpha : 0.8, obtim beta : 0.999 , optim epsilon : 1e-8, CNN parameters optim-alpha : 0.8 , optim-beta : 0.999 learning rate : 1e-5, weight dcay : 0

Result : │Number of iteration : 221250 CIDEr : 0.275 │Number of iteration : 225000 CIDEr : 0.228 │Number of iteration : 228750 CIDEr : 0.262 │Number of iteration : 232500 CIDEr : 0.245 │Number of iteration : 236250 CIDEr : 0.24 │Number of iteration : 240000 CIDEr : 0.268 │Number of iteration : 243750 CIDEr : 0.216 │Number of iteration : 247500 CIDEr : 0.237 │Number of iteration : 251250 CIDEr : 0.248 │Number of iteration : 255000 CIDEr : 0.317 │Number of iteration : 258750 CIDEr : 0.312 │Number of iteration : 262500 CIDEr : 0.304 │Number of iteration : 266250 CIDEr : 0.295 │Number of iteration : 270000 CIDEr : 0.248 │Number of iteration : 273750 CIDEr : 0.292 │Number of iteration : 277500 CIDEr : 0.286

I follow all the instructions and did exactly same thing. Am I the only one that get bad score?? What I have to do to get near CIDEr score 0.7 ?

Plus : How can I tune CNN ??? What does CNN tunning meaning ?? Could someone give me an advise with examples?

ruotianluo commented 8 years ago

You definitely need to fine-tune CNN. Without fine-tuning you are just training a language model.

Note that, there is a enbedding layer added in the CNN. If you don't fine-tune, the enbedding layer will output some near 0 vector, which means inhibiting the signal from image.

If you want to fine-tune, just follow the instruction.

spk921 commented 8 years ago

Thank you, Yes days ago I did find the flag of fine tune. However, in the README it will go upto CIDEr 0.7 point without fine-tune and then do fiine tune to get 0.9 point. Also, my CIDEr point isn't increased after fine-tune flag on. Did you get a nice CIDEr point? Does default parameter setting is good to use ?

ruotianluo commented 8 years ago

Sorry, you are right, the CIDEr should be better after so many iterations according to the instruction. I was using this model for other tasks, so I didn't really look at the CIDEr. I may guess there's something wrong in the evaluation code, but I'm not sure. You can print the images and captions out see if these make sence.

ruotianluo commented 8 years ago

The CIDEr should go higher. (Now I am confused about why random embedding of image feature can work.) I got 0.7 after 40000 iterations.

spk921 commented 8 years ago

So did you do with default setting ?? Without fine tune you got CIDEr 0.7 with default setting with adam optimization?? Also, after 40000 iteration without fine tune flag?? or with fine tune flag? Would you mind giving me more detail information ? In this code embedding is nn.LookupTable. Random nn.LookupTable weight is updating while training. Also, in the show and tell google paper said that pre-trained embedding will not help the performance. I think embedding's weight and bias are trained while iteration.

ruotianluo commented 8 years ago

all default and no fine-tune. I was thinking of the image embedding, it's the linear layer after the vgg, which was not updated before finetuning.

cuongduc commented 8 years ago

@spk921 Hi, how is your situation going? It seems that the same situation also happened with me. I trained the model with default settings and the training has gone through ~35000 iterations. Though I got roughly 0.65 CIDEr point, the model just generates the same caption for any image I give it "a man in a suit and tie standing next to a woman"

y657250418 commented 8 years ago

@spk921 @cuongduc @ruotianluo

Sorry to bother you, but I am very anxious, and I really want to get the BLEU/METEOR/CIDEr scores when running eval.lua.

but when I used the command:

th eval.lua -model /home/yh/checkpoint/model_id1-501-1448236541.t7 -image_folder /home/yh/mscoco/test2015 -num_images -1 -language_eval 1

I get the next error message:

loading annotations into memory... 0:00:00.634204 creating index... index created! using 0/81434 predictions Loading and preparing results...
Traceback (most recent call last): File "myeval.py", line 29, in cocoRes = coco.loadRes(resFile) File "/mnt/disk1/yh/neuraltalk2/neuraltalk2-master/coco-caption/pycocotools/coco.py", line 280, in loadRes if 'caption' in anns[0]: IndexError: list index out of range /home/wonglab/torch/install/bin/luajit: ./misc/utils.lua:17: attempt to index local 'file' (a nil value) stack traceback: ./misc/utils.lua:17: in function 'read_json' ./misc/net_utils.lua:202: in function 'language_eval' eval.lua:167: in function 'eval_split' eval.lua:173: in main chunk [C]: in function 'dofile' ...glab/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

Did I do something wrong? I really need your help!

ruotianluo commented 8 years ago

You actually can't evaluate on test data, because you have no ground truth for test data. If you want to get the captions, turn off language eval.

y657250418 commented 8 years ago

@ruotianluo

I am very appreciate for your reply!

Actually, I can get the image captions already, and I just want to get the BLEU/METEOR/CIDEr scores to evaluate the model I trained. and I also tried to evaluate on val data, but it didn't make sense either.

Can you tell me how to do it? How can I get the scores?

ruotianluo commented 8 years ago

@y657250418 Could you get scores when training the data. If you could, you should get results from val.lua. Your previous error information should be due to the test set, I couldn't know what is really going on if you don't provide the error info for val set.

y657250418 commented 8 years ago

@ruotianluo Thanks a lot, but there are something wrong with my remote server, and I still can't get the scores. I have sent an email to your github email address. Looking forward to your reply.

y657250418 commented 8 years ago

@ruotianluo @spk921 @cuongduc My remote server has been repaired today, but when I try to evaluate the model on val set, it display the same error , and my command is

th eval.lua -model /home/yh/checkpoint/model_id1-501-1448236541.t7 -image_folder /home/yh/neuraltalk2/neuraltalk2-master/coco/images/val2014 -num_images -1 -language_eval 1

the error message:

cp "/home/yh/neuraltalk2/neuraltalk2-master/coco/images/val2014/COCO_val2014_000000198075.jpg" vis/imgs/img40504.jpg image 40504: a group of people riding horses on a beach evaluating performance... 0/-1 (0.000000) loading annotations into memory... 0:00:01.178988 creating index... index created! using 0/40504 predictions Loading and preparing results...
Traceback (most recent call last): File "myeval.py", line 29, in cocoRes = coco.loadRes(resFile) File "/mnt/disk1/yh/neuraltalk2/neuraltalk2-master/coco-caption/pycocotools/coco.py", line 280, in loadRes if 'caption' in anns[0]: IndexError: list index out of range /home/wonglab/torch/install/bin/luajit: ./misc/utils.lua:17: attempt to index local 'file' (a nil value) stack traceback: ./misc/utils.lua:17: in function 'read_json' ./misc/net_utils.lua:202: in function 'language_eval' eval.lua:167: in function 'eval_split' eval.lua:173: in main chunk [C]: in function 'dofile' ...glab/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

I really need your help!

ruotianluo commented 8 years ago

Try some other num_images , like 1. (I think the original code has problem of -1)

y657250418 commented 8 years ago

@ruotianluo I find that whatever the image I try to evaluate, it will copy a new image and rename the image_id. So the image_id in the valevalscript.json created by the function net_utils.language_eval(predictions, id) is different from the id in the captions_val2014.json, the image can't get the ground truth. it always appear

creating index... index created! using # 0/40504 predictions

Did you have this problem? what can I do?

ruotianluo commented 8 years ago

@y657250418 The problem is not different id; it's you didn't predict any captions. using # 0/40504 predictions

y657250418 commented 8 years ago

@ruotianluo I still think the problem is the different id, when I try to eval image from image_folder the program will use DataLoaderRaw.lua to load data instead of DataLoader.lua, it will give picture a new name. So when it try to find the ground truth from captions_val2014.json about the picture, the match result is 0, and when I use the command which not include the -image_folder parameter, I get the model scores.

May I ask you another question about the neuraltalk2? I just want to use word2vec to generate word vectors, and use these vectors to represent the words in the phrase. and I have got the vectors already, but I don't know the place where I can use these vectors to represent the word. I am very confused. Can you tell me? Thanks a lot. I really need your help.

ruotianluo commented 8 years ago

@y657250418 local info_struct = {} info_struct.id = self.ids[ri] info_struct.file_path = self.files[ri] table.insert(infos, info_struct)

The way to find ground truth is using the id here, and this id is from the annotation file.

If you want to use word2vec, you could manually initialize protos.lm.lookup_table.weight using the word2vec. Although it's 'dirty', I think it's the most convenient way.

y657250418 commented 8 years ago

@ruotianluo Can you tell me more details about the method to use word2vec? protos.lm.lookup_table.weight is a parameter? one dimension? or just like all the caption sequence vectors generated by word2vec? I just use it to generate the vectors , and I also don't know how to use the protos.lm.lookup_table either.

ruotianluo commented 8 years ago

@y657250418 the weight of LookupTable is a nxd matrix, n is the size of dictionary and d is the dimension of each word vector. You could read the source code of LookupTable, it's quite straightforward.

y657250418 commented 8 years ago

@ruotianluo You mean that I just need to use a nxd matrix to initialize protos.lm.lookup_table.weight in LanguageModel.lua function layer:__init(opt)? As a example, the dictionary contain 3 word, and their vectors are [1,1,1], [2,2,2],[3,3,3]. So I just need to use self.lookup_table.weight = [[1,1,1],[2,2,2],[3,3,3]]
after self.lookup_table = nn.LookupTable(self.vocab_size + 1, self.input_encoding_size) ? Sorry , it seems stupid, but I just want to make sure. oh, I also want to ask you about the START and END tokens, (+1) means it should be included in the matrix? What is the START and END tokens in the program?

ruotianluo commented 8 years ago

@y657250418 Yes. self.vocab_size + 1 is start token in lookuptable, and is end token in the decoding layer (maybe the wrong term). So in the output of the LSTM, the self.vocab_size means end token.

y657250418 commented 8 years ago

@ruotianluo Sorry, I seem to be a little bit lost. Just as an example, the vocab_size is 3. So LookupTable[self.vocab_size + 1] means the vector correspond to '3' . Right?

ruotianluo commented 8 years ago

@y657250418 I can't understand your example. My point is it's different for encoding and decoding. For example, vocab_size is three, and the words are ['a', 'b', 'c']. Then we will encode a b c to 4,1,2,3 However, when we get a sequence from the language model, 1,2,3,4, It actaully means a b c . So there are two dictionaries for encoding and decoding. And the two dictionaries are the same except for the word in position vocab_size+1.

y657250418 commented 8 years ago

@ruotianluo Oh, you mean that I don't need to consider the start token and the end token when initializing. and I just need to find the nxd matrix. n is the size of dictionary and d is the dimension of each word vector. I know how to do it. Thanks a lot, and you are really kind.

ruotianluo commented 8 years ago

@y657250418 You actually need to consider start token during initialization. n = vocab_size + 1 in my notation. I guess you could just randomlly initialize the vector for start token.

y657250418 commented 8 years ago

@ruotianluo In that case, the dimension of the vector is the same as self.input_encoding_size?

ruotianluo commented 8 years ago

@y657250418 yes