miccunifi / SEARLE

[ICCV 2023] - Zero-shot Composed Image Retrieval with Textual Inversion
Other
151 stars 6 forks source link

Amount of data needed to fine tune CLIP and train the Combiner network from scratch #5

Closed NoamSC closed 5 months ago

NoamSC commented 1 year ago

Hi, I'm facing a composed image retrieval challenge over a large (~4M) image dataset. It's distribution is different from the data CLIP was trained on (specific domain) so the first step presented in your paper is needed even more.

There is no public dataset for my data (specific tech gadgets) so I need to generate one, it is possible - but expensive.

Approximately, how much data do you think is needed? should it be it the triplet format presented in you paper (image, relative prompt, target image)?

Thank you!

LorenzoAgnolucci commented 5 months ago

Hi!

I think you are referring to a different paper.