How to use CLIP VIT-L/14 as the visual encoder of BLIP2?

salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

BSD 3-Clause "New" or "Revised" License

9.6k stars 942 forks source link

How to use CLIP VIT-L/14 as the visual encoder of BLIP2? #375

Closed wooozihui closed 1 year ago

wooozihui commented 1 year ago

The default visual encoder appears to be 'eva_clip_g'. I am curious about the process of transitioning it to the CLIP-L/14. Also, does the CLIP-L/14 encoder utilize the same Q-Former weight as the EVA-CLIP encoder?

wooozihui commented 1 year ago

You should be able to load BLIP-2 clip_L checkpoint with model = load_model("blip2", "pretrain_vitL") The vision embedding size is dynamically adjusted based on the vision encoder used.

https://github.com/salesforce/LAVIS/blob/44151378ae761077c138ec6b6b1b1e418996325c/lavis/models/blip2_models/blip2_qformer.py#L71

I've found this answer which appears to address the application of CLIP-L/14. However, I noticed that the downloaded text model seems significantly smaller than opt2.7. I'm curious as to what this text model is.

YongLD commented 9 months ago

Hi, I don't understand why we should load the whole blip2_pretrained_vitL model. If I just want to use clip_l visual encoder in the instruct-blip2, How can I address the application of CLIP-L/14? I find that we just load the vit as folowing:

        elif model_name == "clip_L":
            visual_encoder = create_clip_vit_L(img_size, use_grad_checkpoint, precision)
        ln_vision = LayerNorm(visual_encoder.num_features)

zhw0516 commented 4 months ago

Have you solved the problem？