salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.8k stars 641 forks source link

Feature extraction for image retrieval and fine tuning #25

Open enrico310786 opened 2 years ago

enrico310786 commented 2 years ago

Hi, congratulation for the results.

My questions are about a correct use of the exit features for the retrieval task and of the finetuning phase. 1) In the colab notebook, on the section 'Feature extraction' the model has three possible outputs: multimodal_feature, image_feature and text_feature. Are they the outputs of the, respectively, image-grounded text encoder, image encoder and text encoder? Thus, if i want to check if two images+text are similar i have to measure the distances from their multimodal_features, right? If, on the other hand, I just want to verify the similarity between only images or texts, i have to use just the image_feature or text_feature, right? 2) To perform the feature extraction i see that the colab notebook uses the model_base.pth. May I use the already finetuned models for Image-Text Retrieval (COCO) in BLIP w/ ViT-B or BLIP w/ ViT-L? Are the able to extract the multimodal_feature, image_feature and text_feature in the same way of the model_base? 3) Instead of fine tuning the Image-Text Retrieval model on COCO dataset, may i use the same script but another dataset, for example with the image caption in italian language instead of english? If the language changes, it is necessary to change parameters in the BERT language model or the model adapts to the Italian language during fine tuning?

Many thanks, Enrico

LiJunnan1992 commented 2 years ago

Hi, thank for your questions:

  1. About computing image-text similarity, we provide two ways in the colab notebook: (a) using multimodal feature + image-text matching (ITM) head, (b) using unimodal features and computing their cosine similarity.
  2. Yes you can use the finetuned models.
  3. You can still use our pre-trained BERT, but the tokenzier may not be the most suitable for Italian, and the model is also not pre-trained for non-english languages.
enrico310786 commented 2 years ago

Hi, thank for the hint.

If I want to use an Italian BERT with the correct Italian tokenizer, in order to pre-train the model with the Pre-training datasets (with the sentences translate in italian language), could i use the pretrain script?

python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

After that, i could fine tune the pretrained model on the COCO dataset always with the sentences translate in italian language. Could it make sense?

LiJunnan1992 commented 2 years ago

Yes you could do that.