mlfoundations / open_clip

An open source implementation of CLIP.
Other
10.38k stars 986 forks source link

Make HF clip support models using an HF text encoder #250

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

Eg https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/tree/main

Need more config, adapting the weights and also changing the model at https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/modeling_clip.py#L938 or creating a new one to support these clip variants

thedarkzeno commented 1 year ago

Hey @rom1504, any progress porting it to huggingface transformers?

rom1504 commented 1 year ago

No, I'm not working actively on this. Feel free to contribute!

On Sun, Dec 4, 2022, 00:01 Adalberto @.***> wrote:

Hey @rom1504 https://github.com/rom1504, any progress porting it to huggingface transformers?

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/250#issuecomment-1336274328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437WUGQ2ZN6FMJ7PYRGDWLPGLJANCNFSM6AAAAAASKT5WOA . You are receiving this because you were mentioned.Message ID: @.***>

calpt commented 1 year ago

Hey, is there any update on this? Would be great to be able to use the multilingual CLIP models with HF Transformers similar to the earlier monolingual ones. Is there any procedure available to convert the weights & config from OpenCLIP to Transformers? Would also be open to help if given some guidance.

rom1504 commented 1 year ago

Fyi @rwightman

betterze commented 1 year ago

any update?

rwightman commented 1 year ago

I have not spent time on this, it looks like https://huggingface.co/docs/transformers/model_doc/vision-text-dual-encoder might be the approach. Would need to remap the checkpoints from OpenCLIP (I have something hacked together here for the stand vit models that could be improved https://gist.github.com/rwightman/daa232b313d82b881d7f86b00dff97dd) ... since the HF text model is the same, just need to change the prefix, and then remap any extra projection layers.

calpt commented 1 year ago

Hey, I just wanted to share the solution I finally used myself:

Script: https://gist.github.com/calpt/8e3555bd11f1916b5169c8125117e5ee

Explanation: Similar to @rwightman's suggestion, I wrote a script to map the checkpoint weights from OpenCLIP to HF Transformers' VisionTextDualEncoder. This worked without issues for the ViT model and most of the XLM-Roberta model. However, as the extra pooling & projection layers on top of XLM-Roberta are not supported out-of-the-box by HF Transformers, I created a custom model class (deriving from VisionTextDualEncoder) to correctly map those weights as well.

versae commented 1 year ago

Are there plans for an official conversion script that also supports RoBERTa models as the text encoder?