openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
26.1k stars 3.33k forks source link

How to convert CLIP vectors to an LLM text token embeddings? #460

Open xoraizw opened 2 months ago

xoraizw commented 2 months ago

I'm looking to embed multiple modalities into your conventional text based LLMs. For that I need to convert any modality into a CLIP vector which I have done, now I need to convert this vector into an LLM text token embedding. Can anyone help me out with this conversion?