microsoft / Oscar

Oscar and VinVL
MIT License
1.03k stars 248 forks source link

Getting incomplete caption sentences when using model fine-tuned with CIDEr optimization #80

Open gsrivas4 opened 3 years ago

gsrivas4 commented 3 years ago

I have fine-tuned two bert base Oscar checkpoints to get captions predictions. The fine-tuning was based on the instructions provided in the section here - https://github.com/microsoft/Oscar/blob/master/MODEL_ZOO.md#image-captioning-on-coco. The first model is fine-tuned with cross-entropy loss (based on the point 1 of the instructions) starting with pretrained_models/base-vg-labels/ep_67_588997. The second model is fine-tuned using CIDEr optimization starting with the best checkpoint from the fine-tuned cross-entropy loss (based on the point 2 of the instructions).

I am using image features for fine-tuning the model generated by this repository - https://github.com/airsplay/py-bottom-up-attention. So, I have not used author generated image features.

Even though the model optimized with CIDEr optimization has higher CIDEr score (CIDEr = 1.29) compared to the model fine-tuned with cross-entropy loss (CIDEr = 1.14) on COCO validation dataset, the captions predicted using CIDEr optimized model are incomplete.

Below are captions generated by the two fine-tuned checkpoints for one of the images from COCO captions validation set: image Caption generated by model fine-tuned with cross entropy loss: a truck that is carrying a bunch of bananas in the back. Caption generated by model fine-tuned with CIDEr optimization: a truck with bananas in the back of a

Most of the captions generated with model fine-tuned with CIDEr optimization have similar issue that the end of the sentence is missing. Some more examples of the captions generated by the model fine-tuned with CIDEr optimization are following - a man holding a stop sign on a in a, a man feeding a giraffe through a fence of a. I have observed similar issue with the CIDEr optimized checkpoint from Bert large model.

If anyone else has seen similar issue, or would know why I am seeing this issue, it would be really helpful for me to get to know how they resolved the issue.

Alcoholrithm commented 3 years ago

@gsrivas4 Hi i have made single image caption inference program using oscar. I get feature, boxes, and pred_classes using https://github.com/airsplay/py-bottom-up-attention and make new feature vector using https://github.com/microsoft/Oscar/issues/33 and feed them to pretrained oscar model. But, there's something wrong . So I got absolutely wrong caption. Can you tell me how you did it in detail?

Sorry for not answering your question

gsrivas4 commented 3 years ago

@Alcoholrithm I have followed similar steps as you have mentioned, but I have fine-tuned the existing checkpoints with the features generated from airsplay code. First I generated features, boxes and pred_classes for both train and val set using airsplay repo. Then I fine-tuned the author provided checkpoint for 30 epochs as per the commands mentioned in this section - https://github.com/microsoft/Oscar/blob/master/MODEL_ZOO.md#image-captioning-on-coco. The fine-tuned checkpoint generates decent captions for me on new images. Try fine-tuning the checkpoint and it should work.

Alcoholrithm commented 3 years ago

@gsrivas4 Thank you so much!!!

xiaoweihu commented 3 years ago

Hi,

For CIDEr optimization, if there is no EOS token at the end of the training captions. The model tends to generate incomplete sentences. This is fixed in the current version here. The idea is to wrap the sentence to ensure that all the training captions end with the EOS token. By using this, I did not find incomplete sentences in CIDEr optimization results.

gsrivas4 commented 3 years ago

@xiaoweihu Will try with the current version of the repo. Thank you!