Conceptual help on SigLIP + pre-trained CLIP

How logical is using a pre-trained CLIP checkpoint to train on SigLIP loss

I have millions of image-text pairs for fashion, each image could potentially contain multiple attributes like colors, fabrics, embellishments. The data is very sparse, i.e for some dresses I might only have the caption "white dress", while for others I have "white dress with back zipper and buttons and sequins"

To be able to do zero shot classification I used fashion clip and fine-tuned it using LoRA achieving an effective batch size of: 12000. Then after training I defined a hierarchical softmax approach to do the "multi-label" classification, so logits of colors get compared only with colors, fabrics with fabrics, etc. That works partially well but in groups like closures with 5 different categories. A jacket can contain front zipper and front line buttons at the same time, so softmax is hurting a lot those cases

I checked that SigLIP uses another approach based on sigmoid, so it is more suitable for the cases I am facing. So I tried fine-tuning again with LoRA on the base 224 checkpoint but I noticed the model has very low logits/probabilities when the captions are very simple "red dress". So I was thinking maybe it makes sense to use fashion-CLIP and fine tune it with the sigmoid loss of siglip? maybe the logits are bigger initially compared to pre-trained SigLIP? Does it make sense?

Also, is the sigmoid loss less susceptible to cases in which I have similar captions on the same batch "class collisions". ie. having multiple dresses on a batch with the same caption?

mlfoundations / open_clip

Conceptual help on SigLIP + pre-trained CLIP #889

How logical is using a pre-trained CLIP checkpoint to train on SigLIP loss