Amount of data needed to fine tune CLIP and train the Combiner network from scratch

Hi, I'm facing a composed image retrieval challenge over a large (~4M) image dataset. It's distribution is different from the data CLIP was trained on (specific domain) so the first step presented in your paper is needed even more.

There is no public dataset for my data (specific tech gadgets) so I need to generate one, it is possible - but expensive.

Approximately, how much data do you think is needed? should it be it the triplet format presented in you paper (image, relative prompt, target image)?

Thank you!

miccunifi / SEARLE

Amount of data needed to fine tune CLIP and train the Combiner network from scratch #5