Open rom1504 opened 1 year ago
Hey @rom1504, any progress porting it to huggingface transformers?
No, I'm not working actively on this. Feel free to contribute!
On Sun, Dec 4, 2022, 00:01 Adalberto @.***> wrote:
Hey @rom1504 https://github.com/rom1504, any progress porting it to huggingface transformers?
— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/250#issuecomment-1336274328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437WUGQ2ZN6FMJ7PYRGDWLPGLJANCNFSM6AAAAAASKT5WOA . You are receiving this because you were mentioned.Message ID: @.***>
Hey, is there any update on this? Would be great to be able to use the multilingual CLIP models with HF Transformers similar to the earlier monolingual ones. Is there any procedure available to convert the weights & config from OpenCLIP to Transformers? Would also be open to help if given some guidance.
Fyi @rwightman
any update?
I have not spent time on this, it looks like https://huggingface.co/docs/transformers/model_doc/vision-text-dual-encoder might be the approach. Would need to remap the checkpoints from OpenCLIP (I have something hacked together here for the stand vit models that could be improved https://gist.github.com/rwightman/daa232b313d82b881d7f86b00dff97dd) ... since the HF text model is the same, just need to change the prefix, and then remap any extra projection layers.
Hey, I just wanted to share the solution I finally used myself:
Script: https://gist.github.com/calpt/8e3555bd11f1916b5169c8125117e5ee
Explanation: Similar to @rwightman's suggestion, I wrote a script to map the checkpoint weights from OpenCLIP to HF Transformers' VisionTextDualEncoder. This worked without issues for the ViT model and most of the XLM-Roberta model. However, as the extra pooling & projection layers on top of XLM-Roberta are not supported out-of-the-box by HF Transformers, I created a custom model class (deriving from VisionTextDualEncoder) to correctly map those weights as well.
Are there plans for an official conversion script that also supports RoBERTa models as the text encoder?
Eg https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/tree/main
Need more config, adapting the weights and also changing the model at https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/modeling_clip.py#L938 or creating a new one to support these clip variants