Incorporate other embedding models such as DINOv2?

rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

https://rom1504.github.io/clip-retrieval/

MIT License

2.35k stars 208 forks source link

Incorporate other embedding models such as DINOv2? #388

Open YuanyuanLi96 opened 1 month ago

YuanyuanLi96 commented 1 month ago

I enjoy using this library very much. However, I notice that other embedding techniques like DINOv2 may also be used in building the search index, and perhaps leads to higher retrieval accuracy. Is there an easy way I can load the 'facebook/dinov2-base' model from huggingface and still use clip_inference?

ytzeng1 commented 3 weeks ago

One quick and dirty approach is just to load the state dict of DINOv2 to the visual encoder of a CLIP model, see the discuss in this thread if you are using open_clip. You probably need to retrain your text encoder in LIT style to align the text and image in the latent space if you wish to keep the text search functionality.