salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.61k stars 614 forks source link

Inference code for video-text retrieval on MSRVTT! #18

Open xmu-xiaoma666 opened 2 years ago

xmu-xiaoma666 commented 2 years ago

Thank you for your great open-source code, I am excited for the outstanding zero-shot performance over video-text retrieval. Can you share the inference code for video-text retrieval on MSRVTT, thanks!

LiJunnan1992 commented 2 years ago

We will release code for video-text tasks soon, thanks.

nikky4D commented 2 years ago

I would like to make my own video-text retrieval demo but I am not sure how to begin. Can you give me some idea of how to start? Given a text, and video, would just taking an average over the per-image embeddings be good enough?

LiJunnan1992 commented 2 years ago

I concatenate the frame embeddings as cross-attention input to the text encoder.

nikky4D commented 2 years ago

thank you. If I understand correctly, you do the following:

Is this correct steps?

tongyao-zhu commented 2 years ago

thanks for the great work! I'm also trying to process multiple images paired with one text. However, I realise that the GPU memory is an issue when the concatenated sequence length becomes too long. Do you take all of the patches (length 197) as the frame embedding, or only the [CLS] token's feature?

LiJunnan1992 commented 2 years ago

thank you. If I understand correctly, you do the following:

  • Sample the video to get frames
  • Pass the frames through BLIP, individually, to get frame embeddings
  • Concatenate these frames (Question: concatenate into a single sequence, or stack into a block matrix)
  • Pass the concatenated frame to text encoder as blip_model(frame_embedding, text_embedding, ...)

Is this correct steps?

Yes those are the correct steps @nikky4D: The frames are concatenate into a single sequence of embeddings.

LiJunnan1992 commented 2 years ago

thanks for the great work! I'm also trying to process multiple images paired with one text. However, I realise that the GPU memory is an issue when the concatenated sequence length becomes too long. Do you take all of the patches (length 197) as the frame embedding, or only the [CLS] token's feature?

Hi @tongyao-zhu , I took all the patches as frame embedding. The GPU memory issue is more likely due to an increase computation of the visual encoder. The memory can be reduced by sparse frame sampling or using gradient_checkpoint (changing the config file).

LiJunnan1992 commented 2 years ago

Hi all, zero-shot video-text retrieval code has been added, please check the updated readme for instructions. Thanks!