ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.52k stars 797 forks source link

Contributing SigLIP to `mlx-examples` #747

Open suvadityamuk opened 2 months ago

suvadityamuk commented 2 months ago

Hello!

I'm new to the MLX ecosystem, and I came across the fact that there is a valid CLIP implementation available in the repository. Keeping in mind that SigLIP has the same structure as that of CLIP (as per the HF implementation available), I was wondering if there would be any interest for converting the SigLIP weights to a MLX-compatible format?

Many community modelling efforts to create MLLMs have been consistently choosing SigLIP over the original CLIP model due to better performance.

Happy to discuss and understand if there are any nuances I may have overlooked, or should keep in mind. Thank you so much! (and thank you so much for MLX, such a life-saver for folks doing ML dev on Apple Silicon!)

Thoughts? @awni

awni commented 2 months ago

Definitely! If they run out of the box with the MLX Clip example from Hugging Face that would be awesome.

You can upload them to the HF MLX Community (https://huggingface.co/mlx-community/) or to your own HF space. Either is ok!

x4080 commented 1 month ago

@suvadityamuk how do you convert siglip to mlx ? I tried using python convert.py --hf-repo google/siglip-base-patch16-224 and not success, merges.txt not found or something

sujantkumarkv commented 3 weeks ago

i was thinking along the exact same lines as @suvadityamuk. what's the update here or if i shall work on it?

awni commented 3 weeks ago

No update that I know of. I think if its a small modification of the Clip example we could include it there?

suvadityamuk commented 3 weeks ago

@awni @sujantkumarkv Been trying to use it out-of-the-box but was facing those issues. Have been caught up for a bit trying to experiment but no luck. It seems there would have to be some modifications done to the existing CLIP example to get it to work.

sujantkumarkv commented 3 weeks ago

I'm asking to see if i'm in the right direction here. Feel free to teach me here, not an expert.

No update that I know of. I think if its a small modification of the Clip example we could include it there?

based on my reading, SigLIP paper only is a small tweak which uses sigmoid loss against the CLIP's softmax which depended on a global state. Thus, it helps in handling much larger batch sizes but also maintaining the performance on smaller batches during pretraining. So, the sigLip example might include changes in loss function, probably here and maybe some more slight changes (need to read paper in depth)

Been trying to use it out-of-the-box but was facing those issues. Have been caught up for a bit trying to experiment but no luck. It seems there would have to be some modifications done to the existing CLIP example to get it to work.

Can you elaborate on what exactly you tried and what didn't work? I didn't read the transformer's implementation full & maybe I'm wrong but CLIP & SigLIP must be almost similar with little changes similar to what I said?

cc @awni @suvadityamuk

awni commented 3 weeks ago

I haven't read the paper so I can't confirm. But if it indeed is a small delta from CLIP then it could make sense to support it in the CLIP example. If not, its probably not worth including in MLX examples. It would be great to have something like a mlx-clip package maintained by the community which can load all of those clip models!

sujantkumarkv commented 3 weeks ago

okay will look into this.