Open spk921 opened 8 years ago
You definitely need to fine-tune CNN. Without fine-tuning you are just training a language model.
Note that, there is a enbedding layer added in the CNN. If you don't fine-tune, the enbedding layer will output some near 0 vector, which means inhibiting the signal from image.
If you want to fine-tune, just follow the instruction.
Thank you, Yes days ago I did find the flag of fine tune. However, in the README it will go upto CIDEr 0.7 point without fine-tune and then do fiine tune to get 0.9 point. Also, my CIDEr point isn't increased after fine-tune flag on. Did you get a nice CIDEr point? Does default parameter setting is good to use ?
Sorry, you are right, the CIDEr should be better after so many iterations according to the instruction. I was using this model for other tasks, so I didn't really look at the CIDEr. I may guess there's something wrong in the evaluation code, but I'm not sure. You can print the images and captions out see if these make sence.
The CIDEr should go higher. (Now I am confused about why random embedding of image feature can work.) I got 0.7 after 40000 iterations.
So did you do with default setting ?? Without fine tune you got CIDEr 0.7 with default setting with adam optimization?? Also, after 40000 iteration without fine tune flag?? or with fine tune flag? Would you mind giving me more detail information ? In this code embedding is nn.LookupTable. Random nn.LookupTable weight is updating while training. Also, in the show and tell google paper said that pre-trained embedding will not help the performance. I think embedding's weight and bias are trained while iteration.
all default and no fine-tune. I was thinking of the image embedding, it's the linear layer after the vgg, which was not updated before finetuning.
@spk921 Hi, how is your situation going? It seems that the same situation also happened with me. I trained the model with default settings and the training has gone through ~35000 iterations. Though I got roughly 0.65 CIDEr point, the model just generates the same caption for any image I give it "a man in a suit and tie standing next to a woman"
@spk921 @cuongduc @ruotianluo
Sorry to bother you, but I am very anxious, and I really want to get the BLEU/METEOR/CIDEr scores when running eval.lua.
but when I used the command:
th eval.lua -model /home/yh/checkpoint/model_id1-501-1448236541.t7 -image_folder /home/yh/mscoco/test2015 -num_images -1 -language_eval 1
I get the next error message:
loading annotations into memory...
0:00:00.634204
creating index...
index created!
using 0/81434 predictions
Loading and preparing results...
Traceback (most recent call last):
File "myeval.py", line 29, in
Did I do something wrong? I really need your help!
You actually can't evaluate on test data, because you have no ground truth for test data. If you want to get the captions, turn off language eval.
@ruotianluo
I am very appreciate for your reply!
Actually, I can get the image captions already, and I just want to get the BLEU/METEOR/CIDEr scores to evaluate the model I trained. and I also tried to evaluate on val data, but it didn't make sense either.
Can you tell me how to do it? How can I get the scores?
@y657250418 Could you get scores when training the data. If you could, you should get results from val.lua. Your previous error information should be due to the test set, I couldn't know what is really going on if you don't provide the error info for val set.
@ruotianluo Thanks a lot, but there are something wrong with my remote server, and I still can't get the scores. I have sent an email to your github email address. Looking forward to your reply.
@ruotianluo @spk921 @cuongduc My remote server has been repaired today, but when I try to evaluate the model on val set, it display the same error , and my command is
th eval.lua -model /home/yh/checkpoint/model_id1-501-1448236541.t7 -image_folder /home/yh/neuraltalk2/neuraltalk2-master/coco/images/val2014 -num_images -1 -language_eval 1
the error message:
cp "/home/yh/neuraltalk2/neuraltalk2-master/coco/images/val2014/COCO_val2014_000000198075.jpg" vis/imgs/img40504.jpg
image 40504: a group of people riding horses on a beach
evaluating performance... 0/-1 (0.000000)
loading annotations into memory...
0:00:01.178988
creating index...
index created!
using 0/40504 predictions
Loading and preparing results...
Traceback (most recent call last):
File "myeval.py", line 29, in
I really need your help!
Try some other num_images , like 1. (I think the original code has problem of -1)
@ruotianluo I find that whatever the image I try to evaluate, it will copy a new image and rename the image_id. So the image_id in the valevalscript.json created by the function net_utils.language_eval(predictions, id) is different from the id in the captions_val2014.json, the image can't get the ground truth. it always appear
creating index... index created! using # 0/40504 predictions
Did you have this problem? what can I do?
@y657250418 The problem is not different id; it's you didn't predict any captions. using # 0/40504 predictions
@ruotianluo I still think the problem is the different id, when I try to eval image from image_folder the program will use DataLoaderRaw.lua to load data instead of DataLoader.lua, it will give picture a new name. So when it try to find the ground truth from captions_val2014.json about the picture, the match result is 0, and when I use the command which not include the -image_folder parameter, I get the model scores.
May I ask you another question about the neuraltalk2? I just want to use word2vec to generate word vectors, and use these vectors to represent the words in the phrase. and I have got the vectors already, but I don't know the place where I can use these vectors to represent the word. I am very confused. Can you tell me? Thanks a lot. I really need your help.
@y657250418 local info_struct = {} info_struct.id = self.ids[ri] info_struct.file_path = self.files[ri] table.insert(infos, info_struct)
The way to find ground truth is using the id here, and this id is from the annotation file.
If you want to use word2vec, you could manually initialize protos.lm.lookup_table.weight using the word2vec. Although it's 'dirty', I think it's the most convenient way.
@ruotianluo Can you tell me more details about the method to use word2vec? protos.lm.lookup_table.weight is a parameter? one dimension? or just like all the caption sequence vectors generated by word2vec? I just use it to generate the vectors , and I also don't know how to use the protos.lm.lookup_table either.
@y657250418 the weight of LookupTable is a nxd matrix, n is the size of dictionary and d is the dimension of each word vector. You could read the source code of LookupTable, it's quite straightforward.
@ruotianluo You mean that I just need to use a nxd matrix to initialize protos.lm.lookup_table.weight in LanguageModel.lua function layer:__init(opt)?
As a example, the dictionary contain 3 word, and their vectors are [1,1,1], [2,2,2],[3,3,3].
So I just need to use
self.lookup_table.weight = [[1,1,1],[2,2,2],[3,3,3]]
after
self.lookup_table = nn.LookupTable(self.vocab_size + 1, self.input_encoding_size) ?
Sorry , it seems stupid, but I just want to make sure.
oh, I also want to ask you about the START and END tokens, (+1) means it should be included in the matrix?
What is the START and END tokens in the program?
@y657250418 Yes. self.vocab_size + 1 is start token in lookuptable, and is end token in the decoding layer (maybe the wrong term). So in the output of the LSTM, the self.vocab_size means end token.
@ruotianluo Sorry, I seem to be a little bit lost. Just as an example, the vocab_size is 3. So LookupTable[self.vocab_size + 1] means the vector correspond to '3' . Right?
@y657250418 I can't understand your example.
My point is it's different for encoding and decoding.
For example, vocab_size is three, and the words are ['a', 'b', 'c'].
Then we will encode
@ruotianluo Oh, you mean that I don't need to consider the start token and the end token when initializing. and I just need to find the nxd matrix. n is the size of dictionary and d is the dimension of each word vector. I know how to do it. Thanks a lot, and you are really kind.
@y657250418 You actually need to consider start token during initialization. n = vocab_size + 1 in my notation. I guess you could just randomlly initialize the vector for start token.
@ruotianluo In that case, the dimension of the vector is the same as self.input_encoding_size?
@y657250418 yes
Thank you for sharing wonderful code. I did exactly same thing with default setting. Only different is batch size 16 -> 32 LM parameters rnn size : 512, input encoding size : 512, batch size : 32, drop prob : 0.5, seq per img : 5, optmizer : adam , learning rate : 4e-4 , learning decay every 50000 , optim alpha : 0.8, obtim beta : 0.999 , optim epsilon : 1e-8, CNN parameters optim-alpha : 0.8 , optim-beta : 0.999 learning rate : 1e-5, weight dcay : 0
Result : │Number of iteration : 221250 CIDEr : 0.275 │Number of iteration : 225000 CIDEr : 0.228 │Number of iteration : 228750 CIDEr : 0.262 │Number of iteration : 232500 CIDEr : 0.245 │Number of iteration : 236250 CIDEr : 0.24 │Number of iteration : 240000 CIDEr : 0.268 │Number of iteration : 243750 CIDEr : 0.216 │Number of iteration : 247500 CIDEr : 0.237 │Number of iteration : 251250 CIDEr : 0.248 │Number of iteration : 255000 CIDEr : 0.317 │Number of iteration : 258750 CIDEr : 0.312 │Number of iteration : 262500 CIDEr : 0.304 │Number of iteration : 266250 CIDEr : 0.295 │Number of iteration : 270000 CIDEr : 0.248 │Number of iteration : 273750 CIDEr : 0.292 │Number of iteration : 277500 CIDEr : 0.286
I follow all the instructions and did exactly same thing. Am I the only one that get bad score?? What I have to do to get near CIDEr score 0.7 ?
Plus : How can I tune CNN ??? What does CNN tunning meaning ?? Could someone give me an advise with examples?