Closed wooozihui closed 1 year ago
You should be able to load BLIP-2 clip_L checkpoint with
model = load_model("blip2", "pretrain_vitL")
The vision embedding size is dynamically adjusted based on the vision encoder used.
I've found this answer which appears to address the application of CLIP-L/14. However, I noticed that the downloaded text model seems significantly smaller than opt2.7. I'm curious as to what this text model is.
Hi, I don't understand why we should load the whole blip2_pretrained_vitL model. If I just want to use clip_l visual encoder in the instruct-blip2, How can I address the application of CLIP-L/14? I find that we just load the vit as folowing:
elif model_name == "clip_L":
visual_encoder = create_clip_vit_L(img_size, use_grad_checkpoint, precision)
ln_vision = LayerNorm(visual_encoder.num_features)
Have you solved the problem?
The default visual encoder appears to be 'eva_clip_g'. I am curious about the process of transitioning it to the CLIP-L/14. Also, does the CLIP-L/14 encoder utilize the same Q-Former weight as the EVA-CLIP encoder?