is BLIP w/ ViT-L and CapFilt-L model for image captioning exist?

salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BSD 3-Clause "New" or "Revised" License

4.86k stars 648 forks source link

is BLIP w/ ViT-L and CapFilt-L model for image captioning exist? #123

Open 4thfever opened 1 year ago

4thfever commented 1 year ago

Hi,

At first I would like to say thank you for your great work which inspires me a lot.

I would like to know, is a BLIP w/ ViT-L + CapFilt-L model (use vit large as encoder and CapFilt for data augment) exist? I believe it should be stronger compared with BLIP w/ ViT-B + CapFilt-L and BLIP w/ ViT-L.

Thanks

LiJunnan1992 commented 1 year ago

Thanks for your question. BLIP w/ ViT-L already uses CapFilt-L model.

4thfever commented 1 year ago

Thanks!