Open qq283215389 opened 5 years ago
thats very common in the first several epochs. Try training it a little bit longer. Or just restart the training.
ok, thanks a lot, for another VSE model(VSEAttModel) and "pair loss" , whose result isn't shown in your paper "Discriminability objective for training descriptive captions" in CVPR 2018?
Pair loss is worse and vseattmodel gives worse result too.
thanks!if the retrieval model perform better(like the paper“Stacked Cross Attention for Image-Text Matching”),can we get a better result for captioning model?
I think it's very likely.
hello,luo It's my result of pre-training retrieval model after i run “run_fc_con.sh”, there is still a difference with your result presented in your paper for the retrieval model. Result: Average i2t Recall: 53.9 Image to text: 29.9 59.2 72.6 4.0 19.6 Average t2i Recall: 42.3 Text to image: 20.6 46.5 59.8 7.0 40.8
Did you download my pretrained model? Does it perform better and the same as what's reported in the paper? https://drive.google.com/open?id=1oQ_O-O2KoSQv1xdBPKaIOGt-VW0gS-42 These are my training curves, to give you a hint.
i might get the problem,i have used the size of 7x7 for coco fc features, i think u have used 14x14 for coco fc features?
fc feature doest have spatial dimensions, it's a vector
I found other paper use Karpathy'split for COCO, your paper use rama's split, whose test data are the same? why you can compare your result with the result in self-critical?
the splits are different. The self critical one is my implementation on Rama's split. Using Rama split I'd because we need to compare ours to Rama's result.
Hello, luo when I pretrain the VSEFCmodel, the vse_loss doesn't converge well , just around 51.2. is there some mistakes in my experiments, how about your vse_loss when you pretrain VSEFCmodel?