salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.92k stars 972 forks source link

Reproducing BLIP2 COCO ITM Fine-tuning and Adding New Data #275

Open yonatanbitton opened 1 year ago

yonatanbitton commented 1 year ago

Hey BLIP-2 team,

Thanks for your great work! I've been trying to reproduce the BLIP2 COCO ITM fine-tuning using the resources in your repo:

  1. train.py
  2. blip_image_text_matching.ipynb
  3. train_caption_coco.sh
  4. blip_itm_large.yaml

I couldn't find specific instructions or a command to reproduce the COCO ITM fine-tuning. As I understand train_caption_coco.sh relates to captioning and blip_itm_large.yaml is BLIP1 not BLIP2. I also searched in the code and previous GitHub issues. Could you share the exact command or script to run this?

Also, I plan to add new fine-tuning data later. Any tips on incorporating new data would be awesome.

Thanks for your help and your amazing work on BLIP-2!

yonatanbitton commented 1 year ago

@LiJunnan1992 pinging to see if you have an idea about this issue 🙏 🙌

LiJunnan1992 commented 1 year ago

You can create a blip2_retrieval model by modifying blip2_qformer to take into account samples["image_id"] when computing ITC and ITM, as done in blip_retrieval.

Then, you can create a yaml file for training on coco retrieval by following the template of this file.

For adding new dataset, you may refer to the LAVIS documentation.

shengyi4 commented 1 year ago

You can create a blip2_retrieval model by modifying blip2_qformer to take into account samples["image_id"] when computing ITC and ITM, as done in blip_retrieval.

Then, you can create a yaml file for training on coco retrieval by following the template of this file.

For adding new dataset, you may refer to the LAVIS documentation.

Could you please release the code so that we can reproduce the result? I cannot make it work based on this information. Thanks much!

yonatanbitton commented 1 year ago

@LiJunnan1992 sorry for the late response, but I also can't reproduce your results based on this information. Is there any chance to provide your implementation first to reproduce the results on ITM? Later we can try to understand how to fit this into new data. Supplying that will allow several valuable extensions of the BLIP2 model 🙏 (also to follow up on this Tweet). Thank you 🙌

LiJunnan1992 commented 1 year ago

@yonatanbitton @shengyi4 You can now finetune for retrieval by running this script: https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip2/train/train_retrieval_coco.sh

yonatanbitton commented 1 year ago

@yonatanbitton @shengyi4 You can now finetune for retrieval by running this script: https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip2/train/train_retrieval_coco.sh

Thank you very much, I am checking that