how come evaluation result is very bad, Bleu_4 is 0.000, Meteor is 0.009 . BTW, how to generate caption on customized dataset

SkylerZheng commented 6 years ago

okenization... PTBTokenizer tokenized 99344 tokens at 721579.12 tokens per second. PTBTokenizer tokenized 16786 tokens at 239467.86 tokens per second. setting up scorers... computing Bleu score... {'reflen': 15132, 'guess': [15165, 13543, 11921, 10299], 'testlen': 15165, 'correct': [30, 0, 0, 0]} ratio: 1.00218080888 Bleu_1: 0.002 Bleu_2: 0.000 Bleu_3: 0.000 Bleu_4: 0.000 computing METEOR score... METEOR: 0.009 computing Rouge score... ROUGE_L: 0.002 computing CIDEr score... CIDEr: 0.001 computing SPICE score... Parsing reference captions Parsing test captions SPICE evaluation took: 2.144 s SPICE: 0.002 loss: {'loss': tensor(31.6388, device='cuda:0'), 'cap_xe': tensor(31.6419, device='cuda:0'), 'retrieval_loss_greedy': tensor(7.4241, device='cuda:0'), 'retrieval_sc_loss': tensor(1.00000e-03 * -3.1324, device='cuda:0'), 'loss_vse': tensor(0., device='cuda:0'), 'loss_cap': tensor(31.6419, device='cuda:0'), 'retrieval_loss': tensor(7.6047, device='cuda:0')} {u'SPICE_Object': '0.006404463463649654', u'SPICE_Cardinality': '0.0', u'SPICE_Attribute': '0.0', 'CIDEr': '0.001079661462843171', u'SPICE_Size': '0.0', 'Bleu_4': 1.04439324421061e-15, 'Bleu_3': 2.3054219540753186e-14, 'Bleu_2': 1.208598304910465e-11, 'Bleu_1': 0.001978239366963272, u'SPICE_Color': '0.0', 'ROUGE_L': '0.001795472073475935', 'METEOR': 0.009059195566343728, u'SPICE_Relation': '0.0', 'SPICE': '0.0024048127567198488'} Terminating BlobFetcher

Is this evaluating the image caption model? It looks like the retrieval model. image 474190: woods conditioner china memorial scraper sash bringing woods interstate sunroof distant image 277907: woods pairs china listed want listed bringing woods crowd image 43033: woods hanging service woods peep dinosaurs cooking wonder image 542103: woods conditioner china memorial gooey bringing cooking gain woody adorable image 356116: woods majestically rice bringing cooking gain woody woods peep image 538581: woods hanging service woods windsurfer dinosaurs cooking weeds woody woods windsurfer image 359354: woods hanging effects woods silver dinosaurs woods silver image 457146: woods captive honk bringing retrieve china woods holds image 75305: woods majestically honk lots woods goofing woody woods silver image 249968: woods troll honk bringing cooking fir china woods bubble foreheads image 480451: woods hanging catchers woods tightly hollow bringing woods tightly hitting image 379596: woods hangings china pouches want pouches bringing woods goofing image 322362: woods patch benched honk bringing woods holds woody woods overgrowth gains image 495233: woods conditioner china memorial honk bringing woods caddy musical woods overgrowth gains image 366948: woods conditioner china lipstick rice dinosaurs woods mirrors image 332833: woods burrito levels honk bringing cooking lock woody woods keypad image 512346: woods hanging service woods draining buddhist dinosaurs woods peek evaluating validation preformance... 2049/5000 (31.236956)

ruotianluo commented 6 years ago

Are you using pertrained model?

SkylerZheng commented 6 years ago

I donot think so. Because I followed the following order:

First train a retrieval model:

bash run_fc_con.sh

Second, pretrain the captioning model.

bash run_att.sh

Third, finetune the captioning model with cider+discriminability optimization:

bash run_att_d.sh 1 (1 is the discriminability weight, and can be changed to other values)

Evaluate bash eval.sh att_d1 test

So I am using the model I just trained, right?

ruotianluo commented 6 years ago

It looks like to me that it's because the dictionary for evaluation doesn't match that for training. Can you verify that?

SkylerZheng commented 6 years ago

Yeah, your are right. I saved my trained model in the log_att directory, and I tested with a wrong model under the directory of "log_att_d1".

Thanks a lot. Now it's testing correctly.

I really appreciate your quick response.

ruotianluo / DiscCaptioning

how come evaluation result is very bad, Bleu_4 is 0.000, Meteor is 0.009 . BTW, how to generate caption on customized dataset #6