Open intuinno opened 8 years ago
Hi, @intuinno what hyperparamter did you pick. I tried to run the on the coco data, the algorithm terminate after 15 epochs and achieved the following scores, which is much lower than yours.
Bleu 1: .527 Bleu 2: .333 Bleu 3: .210 Bleu 4: .138 METEOR: .163 ROUGE_L: .403 CIDEr: .371
Hi, @intuinno , @snakeztc I am running this codes based on the flickr8k, for I just try to test this code. I have run it and got visualization, but I don't know how to get the scores--Bleu and METEOR, could you tell which script about it? Please forgive me if bother you.
Hi, @intuinno, @snakeztc, @yaxingwang my flickr8k scores with parameters in capgen.train() are BLEU = 0.504 / 0.270 / 0.145 / 0.082 Max are BLEU = 0.550 / 0.296 / 0.164 / 0.095 with parameters in eval_coco, + optimizer = rmsprop . My scores are lower than paper: BLEU = 0.670 / 0.457 / 0.314 / 0.213
@yaxingwang metrics.py or I am using a neuraltalk's script(see my repository)
Hi @AAmmy , I am also get similar results like you. But BLEU-1 you get is batter then me(0.30). Do you do normalization for Dataset? After it, I get less result. The metrics.py is used to get results.
@yaxingwang I did not normalize Dataset. My preprocess:
token and train, valid, test are the same with the files in Flickr8k_text.zip
Yes, we am same. I did it, since the result is poor.
for me :
1, if width > height:
width = (width * resize) / height # resize = 256
height = resize
else:
height = (height * resize) / width
width = resize
2, center crop images to 224x224
3, extract features by Vgg_layers5_4
I am confused whether all parameters are same for three dataset, I just run the code offered by @intuinno, and I found the parameters for three datasets are same. Besides, what the value of epoch when script stops, I got epoch = 79, I don't know whether It is over-fitting.
@yaxingwang I think intuinno's evaluate_flickr8k.py's parameters are for coco and flickr30k, the parameters for flickr8k and for flickr30k, coco are not same.(5.2 in paper)
I think parameters in original capgen.py file are for flickr8k.(I used this, stopped around 70epoc)
I also did flickr8k learning with the parameters for coco (same as intuinno evaluate_flickr8k.py, so same with you?), the sores were BLEU = 0.493 / 0.258 / 0.130 / 0.072 eary stop on Epoc 89 (6-12 hours)
I also changed patience and some parameters to check over fitting, after 89 epoc, samples from val seemed getting better, but BLEU score (from test) was getting worse.
@AAmmy , Thank you. I try to do the both flickr30k and coco, but I guess the memory of my computer is too small to process the flickr30, so I am doing it. Do you met the question when doing the flickr30k, It notes MemoryError.
When using epoch = 10, 20, the result is less good than the matter the script early stop . I think got epoch by early stop is not the best, for it has not strong relation with scores. Maybe testing different epochs is optimal.
@yaxingwang I have the same memory problem on coco, and process for sparse to dense is too slow, so I extracted feature into one file for each image.
I changed code and data format like below.
caption example:
train_cap = [['a dog running', OOO.jpg], ['dogs running', OOO.jpg],
..., ['a cat running', +++.jpg], ['cats running', +++.jpg]]
('a dog running' and 'dogs running' are captions for OOO.jpg, OOO.jpg is the image file name, OOO.jpg.mat will be the feature from OOO.jpg)
In flickr.py or coco.py:
In prepare_data():
# load target feat file each time
for cc in caps:
seqs.append([worddict[w] if worddict[w] < n_words else 1 for w in cc[0].split()])
feat_list.append(loadmat(feat_path + str(cc[1]) + '.mat')['feats']) # my code
# feat_list.append(features[cc[1]]) # original code
# OOO.jpg.mat is dense() matrix, so no need to todense()
# y = numpy.zeros((len(feat_list), feat_list[0].shape[1])).astype('float32') # original code
# for idx, ff in enumerate(feat_list): # original code
# y[idx,:] = numpy.array(ff.todense()) # original code
# y = y.reshape([y.shape[0], 14*14, 512]) # original code
y = numpy.array(feat_list).reshape([len(feat_list), 14*14, 512]).astype('float32') # my code
In load_data():
# only caption files are loaded
train_cap = pkl.load(open(path+'flicker_30k_cap.train.pkl', 'rb'))
train_feat = []
Hmm... I will try on different epochs.
@AAmmy , Thanks.
I also created my own scripts to prepare the data. I completely skipped the sparse matrix stuff since I think it's not needed at all. I have a single hdf5 file with CONV5_4 features from VGG19 network for Flickr30k (around 12GB). This file contains all the image features for all splits in the following order: train
, valid
and test
. The order of the jpeg files for matching the order of the feature matrix is also available.
I am pretty sure that I am not doing any mistake (but apparently i am doing since you have at least some results) but all I got is repetitive phrases of meaningless words, with BLEU of 0 and a validation loss which doesn't improve at all.
I create the dictionary in a frequency ordered fashion, 0 is
I don't know where is the problem at all.
@intuinno, your results are closest the reported coco results, which hyper-parameters have you used?
@kelvinxu , @kyunghyuncho , paper does not mention hyper-parameters for different datasets. would you mind providing this information? (plus maybe even the models themselves which are not big for a dropbox/gdrive file)
Hi everybody,
I'd like to share my observations and experimentations about the code on Flickr30k dataset:
Preprocessing:
Feature dimensions:
y.reshape([y.shape[0], 14*14, 512])
was not correct for my feature file and I was obtaining complete nonsense during training. Ensure that the reshaping is done correctly.Early stopping with BLEU:
This seems critical and it's mentioned in the paper as wel but unfortunately not implemented in the code. The validation loss is not correlated with BLEU or METEOR. I just save the model into a temporary file before each validation and call generate_caps.py
to save the hypotheses inside a file. I then used the pycocoevalcap
utilities to obtain BLEU1-BLEU4 and METEOR scores. After that you can select upon which metric you would like to early stop.
Validation:
I normalized the validation loss w.r.t sequence lengths as well. This seems a better estimate of validation loss as the default one is sensible to the caption lengths in the validation batches.
Hyperparameters:
I'm still experimenting but the best working system so far had the following parameters:
n_words: 9584
maxlen: 100
decay_c: 1e-05
alpha_c: 0 (This is 1 in the original code)
use_dropout: False (dropout is enabled by default in the original code)
patience: 10
ctx_dim: 512
dim: 1000 (This is 1800 in the original code)
dim_word: 512
batch_size: 128
optimizer: adam (rmsprop is OK too but adadelta is completely failing)
lstm_encoder: False
n_layers_init: 2
n_layers_att: 2
n_layers_lstm: 1
n_layers_out: 1
ctx2out: True
prev2out: True
selector: True,
attn_type: deterministic (didn't try the hard one)
validFreq: 500
Results:
I trained a system yesterday with early-stopping on BLEU (but this was using the multi-bleu.perl
script which has different dynamics than the pycocoevalcap
utilities). I generated the captions with sampling instead of beam-search during validation periods. At the end I obtained the following results with the best validation model:
(EDIT: Fixed the results of my system which was for the validation split instead of the test split.)
Description | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR |
---|---|---|---|---|---|
Beam (12) | 57.9 | 39.3 | 26.9 | 18.5 | 17.58 |
Sampling | 61.2 | 41.4 | 28.12 | 19.1 | 16.77 |
Paper results (soft) | 66.7 | 43.4 | 28.8 | 19.1 | 18.49 |
Paper results (hard) | 66.9 | 43.9 | 29.6 | 19.9 | 18.46 |
Problems:
The main problem are the duplicate captions in the final files:
$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.sampling.dev.txt | wc -l
853
$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.beam12.1best.dev.txt | wc -l
790
So out of 1014 validation images, I can only generate 853/790 unique captions. This seems to be an important problem that I'm facing. The richness of the captions is also quite limited. For the sampling case, I have 497 unique words out of a vocabulary of ~10K words. For beamsearch, the number is 561.
EDIT I actually checked the captions generated and the images. Eventhough there are for example 10 instances of "a group of people are standing outside" for 10 different images, it's actually true in terms of scene description: In all of the images there are some people standing outside :) So maybe this can be related to the weak diversity of Flickr30k dataset.
The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs.
Nothing
Hi, @ozancaglayan Could you share your code handling batch normalization process, plese?
Validation:
I normalized the validation loss w.r.t sequence lengths as well.
This seems a better estimate of validation loss as the default
one is sensible to the caption lengths in the validation batches.
Hi @intuinno Would you share the model file trained on Coco? Also, what are your best validation/test cost for Flickr8k and Coco? Thanks.
So does anyone get better score on coco? I used @intuinno 's code and I got similar score with him(one the top of this issue) in the end (17epochs). However, when I calculated the score on epoch 10, it turned out to be better than 17th epoch, which is BLEU: 0.6398/0.4518/0.3127/0.218/ METEOR:0.2384
I got BLEU: 0.6887/0.5034/0.3588/0.2547 METEOR: 0.2234 on COCO with http://cs.stanford.edu/people/karpathy/deepimagesent/ the feature size is 4096, so I used them by reshaping 8x512. However Flickr8k training was failed. I didn't try on Flickr30k.
@AAmmy Hi,I tested your code, and got 'Bleu_4': 0.276, 'Bleu_3': 0.367, 'Bleu_2': 0.497, 'Bleu_1': 0.668 with beam_size = 10. was your result based on beam_size of 1?
@AAmmy @xinghedyc Could you please explain about using http://cs.stanford.edu/people/karpathy/deepimagesent/ for me? You used that for extracting features?
@Lorne0 Hi, what you could download form that website is a coco dataset as COCO (750MB)http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip, the vgg_feats.mat contains extracted features through the vgg-net as 4096-dimention for each image, and json file contains all the captions. for more details you can read their paper.
@xinghedyc Thank you. But I still not understand. The feature is 4096 , and @AAmmy said reshaping it as 8x512, and then? Which 8 of the new features should I use?
@Lorne0 I think 8×512 means 8 annotation vectors which the author's paper defines them as a = {a1,...,aL}, ai ∈ RD, you can refer section 3.1.1 in the paper. The original code uses 196 × 512 as the annotation vectors, so @AAmmy tested 8 annotion vectors in soft attention mode by using the dataset above, it actually works.
@xinghedyc Thank you~ I just ran 3 epochs but I use metrics.py I always got IOError: [Errno 32] Broken pipe in pycocoevalcap/meteor/meteor.py Did you have this problem?
@Lorne0 Yes,I also got this problem,so I just comment that line in the metrics.py like this scorers = [ (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
#(Rouge(), "ROUGE_L"),
#(Cider(), "CIDEr")
]
This is because I care more about bleu, but you could try to fix this problem :)
@xinghedyc I think METEOR is important too. I'll try to fix it, thank you~ @AAmmy could you help us about this problem?
@xinghedyc I think I found the solution Just delete the pycocoevalcap/ and clone the newest one :)
@Lorne0 OK, I'll check it.
@Lorne0 @xinghedyc My result BLEU: 0.6887/0.5034/0.3588/0.2547 METEOR: 0.2234 is based on beam_size 1. I checked only on epoch 19. May be there some other epoch results show more better score. (in epoch from 1 to 18) References were made by code written in scripts.py.
@AAmmy thanks, I got bleu-4 23.9 if I use beam size 0f 1, but got 27.6 use beam size of 10. I only trained 11 epochs. maybe more epochs should be trained.
@xinghedyc @AAmmy Just to confirm, the results you get for increasing the beam size are correct. At the time of publication, we were using a beamsize of 1 (mea culpa!!!)
I got bleu1 0.685 bleu2 0.507 bleu3 0.363 bleu4 0.258 meteor 0.234 rouge_L 0.505 cider 0.836
In these days, I use Capgen and VQA model on Tensorflow. It's very flexible. I can share the code if needed.
@AAmmy Could please share your code on tensorflow?
@porcofly https://drive.google.com/open?id=0B9SwS-q4-5HxdUdHdG9yYjZQQjQ model 1, 2 are based on https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow https://github.com/yunjey/show-attend-and-tell respectively.
Excuse me, how long does this training process roughly take? I have run it for about 12 hours, and it is still stuck in epoch 1. I really don't know what's wrong with it? My GPU isQuadro K4200
. Thank you...
The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs. Nothing
I am wondering why pycocoevalcap gives me a different BLEU score compared to multi-bleu.perl? I found 2 sentences and calculate the BLEU score manually, the result matches with multi-bleu.perl but not pycocoevalcap. What algorithm exactly does pycocoevalcap use?
@ammmy Thanks for sharing your tensor flow based repo. If you could share the script to generate the following files (that you have used in your implementation) it will be very useful "tokens.npy", "tokens_flat.npy", "filename.npy", "filepath.npy", "vgg_feats.npy", "tokens_flat_to_image_lookup.npy"
@ammmy Can you post accuracy numbers for below tensorflow based implementations? https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow https://github.com/yunjey/show-attend-and-tell
Hello, everyone,
I got the following score after I ran the coco.
{'CIDEr': 0.50350648251818364, 'Bleu_4': 0.20037826460154334, 'Bleu_3': 0.2920434703847389, 'Bleu_2': 0.42775646056296673, 'Bleu_1': 0.6105274018537202, 'ROUGE_L': 0.43556281782994649, 'METEOR': 0.23890246684760072}
So METEOR is almost same. However my BLEU score are 7~8% lower than paper. I wonder if this is acceptable or there is something wrong in my process.
Would you please share your results in this post?
Thanks.