salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.69k stars 623 forks source link

Pretrained Network #83

Open 12sf12 opened 2 years ago

12sf12 commented 2 years ago

Hi

Thanks for your outstanding work.

I faced an issue when I wanted to load one of the pretrained vit base with this URL: 'https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth'

in the state-dict, the model does not have 'visual_encoder.pos_embed'. Hence, it produces an error. For instance, the following code is not executable:

model_url='https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth' model = blip_decoder(pretrained=model_url, image_size=224, vit='base')

Would it be possible to share with me the recent lightweight pretrained model, because this is only the issue with the model mentioned above.

Many Thanks.

LiJunnan1992 commented 2 years ago

Hi, my implementation of ViT is based on the timm codebase. You might want to try the pretrained weights from timm.