salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.85k stars 648 forks source link

Replacing ViT encoder #56

Closed helleuch closed 2 years ago

helleuch commented 2 years ago

Hello,
I would like to replace the ViT encoder with another transformer architecture. In the state dicts of the BLIP checkpoints there are only keys for "visual_encoder" or "text_decoder". I cannot find the Text Encoder nor the Image-Grounded Text Encoder. My question is: Are those two encoders' weights included in the vision encoder ? I included an image to help you understand what I'l trying to achieve. Thanks :) Screenshot from 2022-05-11 10-11-57

P.S.: I want to do image / video captioning. And I'm using the BLIP_decoder implementation as it seemed to be the closest to my goals.

LiJunnan1992 commented 2 years ago

The BLIP-caption model does not have text encoder. You can simply replace the visual encoder with your model and make sure that its output dimension can match the cross-attention's dimension.

helleuch commented 2 years ago

Thank you very much for your help