kelvinxu / arctic-captions

960 stars 349 forks source link

Post your evaluation score #20

Open intuinno opened 8 years ago

intuinno commented 8 years ago

Hello, everyone,

I got the following score after I ran the coco.

{'CIDEr': 0.50350648251818364, 'Bleu_4': 0.20037826460154334, 'Bleu_3': 0.2920434703847389, 'Bleu_2': 0.42775646056296673, 'Bleu_1': 0.6105274018537202, 'ROUGE_L': 0.43556281782994649, 'METEOR': 0.23890246684760072}

So METEOR is almost same. However my BLEU score are 7~8% lower than paper. I wonder if this is acceptable or there is something wrong in my process.

Would you please share your results in this post?

Thanks.

snakeztc commented 8 years ago

Hi, @intuinno what hyperparamter did you pick. I tried to run the on the coco data, the algorithm terminate after 15 epochs and achieved the following scores, which is much lower than yours.

Bleu 1: .527 Bleu 2: .333 Bleu 3: .210 Bleu 4: .138 METEOR: .163 ROUGE_L: .403 CIDEr: .371

yaxingwang commented 8 years ago

Hi, @intuinno , @snakeztc I am running this codes based on the flickr8k, for I just try to test this code. I have run it and got visualization, but I don't know how to get the scores--Bleu and METEOR, could you tell which script about it? Please forgive me if bother you.

AAmmy commented 8 years ago

Hi, @intuinno, @snakeztc, @yaxingwang my flickr8k scores with parameters in capgen.train() are BLEU = 0.504 / 0.270 / 0.145 / 0.082 Max are BLEU = 0.550 / 0.296 / 0.164 / 0.095 with parameters in eval_coco, + optimizer = rmsprop . My scores are lower than paper: BLEU = 0.670 / 0.457 / 0.314 / 0.213

@yaxingwang metrics.py or I am using a neuraltalk's script(see my repository)

yaxingwang commented 8 years ago

Hi @AAmmy , I am also get similar results like you. But BLEU-1 you get is batter then me(0.30). Do you do normalization for Dataset? After it, I get less result. The metrics.py is used to get results.

AAmmy commented 8 years ago

@yaxingwang I did not normalize Dataset. My preprocess:

  1. center crop images
  2. resize images to 224x224
  3. extract features with VGG_ILSVRC_19_layers

token and train, valid, test are the same with the files in Flickr8k_text.zip

yaxingwang commented 8 years ago

Yes, we am same. I did it, since the result is poor.
for me : 1, if width > height: width = (width * resize) / height # resize = 256 height = resize else: height = (height * resize) / width width = resize 2, center crop images to 224x224 3, extract features by Vgg_layers5_4 I am confused whether all parameters are same for three dataset, I just run the code offered by @intuinno, and I found the parameters for three datasets are same. Besides, what the value of epoch when script stops, I got epoch = 79, I don't know whether It is over-fitting.

AAmmy commented 8 years ago

@yaxingwang I think intuinno's evaluate_flickr8k.py's parameters are for coco and flickr30k, the parameters for flickr8k and for flickr30k, coco are not same.(5.2 in paper)

I think parameters in original capgen.py file are for flickr8k.(I used this, stopped around 70epoc)

I also did flickr8k learning with the parameters for coco (same as intuinno evaluate_flickr8k.py, so same with you?), the sores were BLEU = 0.493 / 0.258 / 0.130 / 0.072 eary stop on Epoc 89 (6-12 hours)

I also changed patience and some parameters to check over fitting, after 89 epoc, samples from val seemed getting better, but BLEU score (from test) was getting worse.

yaxingwang commented 8 years ago

@AAmmy , Thank you. I try to do the both flickr30k and coco, but I guess the memory of my computer is too small to process the flickr30, so I am doing it. Do you met the question when doing the flickr30k, It notes MemoryError.

When using epoch = 10, 20, the result is less good than the matter the script early stop . I think got epoch by early stop is not the best, for it has not strong relation with scores. Maybe testing different epochs is optimal.

AAmmy commented 8 years ago

@yaxingwang I have the same memory problem on coco, and process for sparse to dense is too slow, so I extracted feature into one file for each image.

I changed code and data format like below.

caption example:

train_cap = [['a dog running', OOO.jpg], ['dogs running', OOO.jpg],
                       ..., ['a cat running', +++.jpg], ['cats running', +++.jpg]]

('a dog running' and 'dogs running' are captions for OOO.jpg, OOO.jpg is the image file name, OOO.jpg.mat will be the feature from OOO.jpg)

In flickr.py or coco.py:

In prepare_data():

# load target feat file each time
for cc in caps:
    seqs.append([worddict[w] if worddict[w] < n_words else 1 for w in cc[0].split()])
    feat_list.append(loadmat(feat_path + str(cc[1]) + '.mat')['feats']) # my code
    # feat_list.append(features[cc[1]]) # original code
# OOO.jpg.mat is dense() matrix, so no need to todense()

# y = numpy.zeros((len(feat_list), feat_list[0].shape[1])).astype('float32') # original code
# for idx, ff in enumerate(feat_list): # original code
    # y[idx,:] = numpy.array(ff.todense()) # original code
# y = y.reshape([y.shape[0], 14*14, 512]) # original code
y = numpy.array(feat_list).reshape([len(feat_list), 14*14, 512]).astype('float32') # my code

In load_data():

# only caption files are loaded
train_cap = pkl.load(open(path+'flicker_30k_cap.train.pkl', 'rb'))
train_feat = []

Hmm... I will try on different epochs.

yaxingwang commented 8 years ago

@AAmmy , Thanks.

ozancaglayan commented 8 years ago

I also created my own scripts to prepare the data. I completely skipped the sparse matrix stuff since I think it's not needed at all. I have a single hdf5 file with CONV5_4 features from VGG19 network for Flickr30k (around 12GB). This file contains all the image features for all splits in the following order: train, valid and test. The order of the jpeg files for matching the order of the feature matrix is also available.

I am pretty sure that I am not doing any mistake (but apparently i am doing since you have at least some results) but all I got is repetitive phrases of meaningless words, with BLEU of 0 and a validation loss which doesn't improve at all.

I create the dictionary in a frequency ordered fashion, 0 is and 1 is UNK.

I don't know where is the problem at all.

volkancirik commented 8 years ago

@intuinno, your results are closest the reported coco results, which hyper-parameters have you used?

@kelvinxu , @kyunghyuncho , paper does not mention hyper-parameters for different datasets. would you mind providing this information? (plus maybe even the models themselves which are not big for a dropbox/gdrive file)

ozancaglayan commented 8 years ago

Hi everybody,

I'd like to share my observations and experimentations about the code on Flickr30k dataset:

Preprocessing:

Feature dimensions:

Early stopping with BLEU:

This seems critical and it's mentioned in the paper as wel but unfortunately not implemented in the code. The validation loss is not correlated with BLEU or METEOR. I just save the model into a temporary file before each validation and call generate_caps.py to save the hypotheses inside a file. I then used the pycocoevalcap utilities to obtain BLEU1-BLEU4 and METEOR scores. After that you can select upon which metric you would like to early stop.

Validation:

I normalized the validation loss w.r.t sequence lengths as well. This seems a better estimate of validation loss as the default one is sensible to the caption lengths in the validation batches.

Hyperparameters:

I'm still experimenting but the best working system so far had the following parameters:

n_words: 9584
maxlen: 100
decay_c: 1e-05
alpha_c: 0 (This is 1 in the original code)
use_dropout: False (dropout is enabled by default in the original code)
patience: 10
ctx_dim: 512
dim: 1000 (This is 1800 in the original code)
dim_word: 512
batch_size: 128
optimizer: adam (rmsprop is OK too but adadelta is completely failing)
lstm_encoder: False
n_layers_init: 2
n_layers_att: 2
n_layers_lstm: 1
n_layers_out: 1
ctx2out: True
prev2out: True
selector: True,
attn_type: deterministic (didn't try the hard one)
validFreq: 500

Results:

I trained a system yesterday with early-stopping on BLEU (but this was using the multi-bleu.perl script which has different dynamics than the pycocoevalcap utilities). I generated the captions with sampling instead of beam-search during validation periods. At the end I obtained the following results with the best validation model:

(EDIT: Fixed the results of my system which was for the validation split instead of the test split.)

Description BLEU1 BLEU2 BLEU3 BLEU4 METEOR
Beam (12) 57.9 39.3 26.9 18.5 17.58
Sampling 61.2 41.4 28.12 19.1 16.77
Paper results (soft) 66.7 43.4 28.8 19.1 18.49
Paper results (hard) 66.9 43.9 29.6 19.9 18.46

Problems:

The main problem are the duplicate captions in the final files:

$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.sampling.dev.txt | wc -l
853
$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.beam12.1best.dev.txt | wc -l
790

So out of 1014 validation images, I can only generate 853/790 unique captions. This seems to be an important problem that I'm facing. The richness of the captions is also quite limited. For the sampling case, I have 497 unique words out of a vocabulary of ~10K words. For beamsearch, the number is 561.

EDIT I actually checked the captions generated and the images. Eventhough there are for example 10 instances of "a group of people are standing outside" for 10 different images, it's actually true in terms of scene description: In all of the images there are some people standing outside :) So maybe this can be related to the weak diversity of Flickr30k dataset.

AAmmy commented 8 years ago

The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs. Nothing

AAmmy commented 8 years ago

Hi, @ozancaglayan Could you share your code handling batch normalization process, plese?

Validation:

I normalized the validation loss w.r.t sequence lengths as well. 
This seems a better estimate of validation loss as the default
one is sensible to the caption lengths in the validation batches.
frajem commented 8 years ago

Hi @intuinno Would you share the model file trained on Coco? Also, what are your best validation/test cost for Flickr8k and Coco? Thanks.

Lorne0 commented 8 years ago

So does anyone get better score on coco? I used @intuinno 's code and I got similar score with him(one the top of this issue) in the end (17epochs). However, when I calculated the score on epoch 10, it turned out to be better than 17th epoch, which is BLEU: 0.6398/0.4518/0.3127/0.218/ METEOR:0.2384

AAmmy commented 8 years ago

I got BLEU: 0.6887/0.5034/0.3588/0.2547 METEOR: 0.2234 on COCO with http://cs.stanford.edu/people/karpathy/deepimagesent/ the feature size is 4096, so I used them by reshaping 8x512. However Flickr8k training was failed. I didn't try on Flickr30k.

xinghedyc commented 8 years ago

@AAmmy Hi,I tested your code, and got 'Bleu_4': 0.276, 'Bleu_3': 0.367, 'Bleu_2': 0.497, 'Bleu_1': 0.668 with beam_size = 10. was your result based on beam_size of 1?

Lorne0 commented 8 years ago

@AAmmy @xinghedyc Could you please explain about using http://cs.stanford.edu/people/karpathy/deepimagesent/ for me? You used that for extracting features?

xinghedyc commented 8 years ago

@Lorne0 Hi, what you could download form that website is a coco dataset as COCO (750MB)http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip, the vgg_feats.mat contains extracted features through the vgg-net as 4096-dimention for each image, and json file contains all the captions. for more details you can read their paper.

Lorne0 commented 8 years ago

@xinghedyc Thank you. But I still not understand. The feature is 4096 , and @AAmmy said reshaping it as 8x512, and then? Which 8 of the new features should I use?

xinghedyc commented 8 years ago

@Lorne0 I think 8×512 means 8 annotation vectors which the author's paper defines them as a = {a1,...,aL}, ai ∈ RD, you can refer section 3.1.1 in the paper. The original code uses 196 × 512 as the annotation vectors, so @AAmmy tested 8 annotion vectors in soft attention mode by using the dataset above, it actually works.

Lorne0 commented 8 years ago

@xinghedyc Thank you~ I just ran 3 epochs but I use metrics.py I always got IOError: [Errno 32] Broken pipe in pycocoevalcap/meteor/meteor.py Did you have this problem?

xinghedyc commented 8 years ago

@Lorne0 Yes,I also got this problem,so I just comment that line in the metrics.py like this scorers = [ (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),

(Meteor(),"METEOR"),

    #(Rouge(), "ROUGE_L"),
    #(Cider(), "CIDEr")
]

This is because I care more about bleu, but you could try to fix this problem :)

Lorne0 commented 8 years ago

@xinghedyc I think METEOR is important too. I'll try to fix it, thank you~ @AAmmy could you help us about this problem?

Lorne0 commented 8 years ago

@xinghedyc I think I found the solution Just delete the pycocoevalcap/ and clone the newest one :)

xinghedyc commented 8 years ago

@Lorne0 OK, I'll check it.

AAmmy commented 8 years ago

@Lorne0 @xinghedyc My result BLEU: 0.6887/0.5034/0.3588/0.2547 METEOR: 0.2234 is based on beam_size 1. I checked only on epoch 19. May be there some other epoch results show more better score. (in epoch from 1 to 18) References were made by code written in scripts.py.

xinghedyc commented 8 years ago

@AAmmy thanks, I got bleu-4 23.9 if I use beam size 0f 1, but got 27.6 use beam size of 10. I only trained 11 epochs. maybe more epochs should be trained.

kelvinxu commented 8 years ago

@xinghedyc @AAmmy Just to confirm, the results you get for increasing the beam size are correct. At the time of publication, we were using a beamsize of 1 (mea culpa!!!)

DongNaeSwellfish commented 7 years ago

I got bleu1 0.685 bleu2 0.507 bleu3 0.363 bleu4 0.258 meteor 0.234 rouge_L 0.505 cider 0.836

ammmy commented 7 years ago

In these days, I use Capgen and VQA model on Tensorflow. It's very flexible. I can share the code if needed.

porcofly commented 7 years ago

@AAmmy Could please share your code on tensorflow?

ammmy commented 7 years ago

@porcofly https://drive.google.com/open?id=0B9SwS-q4-5HxdUdHdG9yYjZQQjQ model 1, 2 are based on https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow https://github.com/yunjey/show-attend-and-tell respectively.

shaoxuan92 commented 6 years ago

Excuse me, how long does this training process roughly take? I have run it for about 12 hours, and it is still stuck in epoch 1. I really don't know what's wrong with it? My GPU isQuadro K4200. Thank you...

ChiZhangRIT commented 6 years ago

The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs. Nothing

I am wondering why pycocoevalcap gives me a different BLEU score compared to multi-bleu.perl? I found 2 sentences and calculate the BLEU score manually, the result matches with multi-bleu.perl but not pycocoevalcap. What algorithm exactly does pycocoevalcap use?

kavithasampath commented 6 years ago

@ammmy Thanks for sharing your tensor flow based repo. If you could share the script to generate the following files (that you have used in your implementation) it will be very useful "tokens.npy", "tokens_flat.npy", "filename.npy", "filepath.npy", "vgg_feats.npy", "tokens_flat_to_image_lookup.npy"

xxxyyyzzzz commented 6 years ago

@ammmy Can you post accuracy numbers for below tensorflow based implementations? https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow https://github.com/yunjey/show-attend-and-tell