Closed raunak-agarwal closed 1 year ago
Thanks for your suggestion! This looks interesting to me! I think maybe it is time for me and my colleagues to think about incorporating adding additional forms of loss into Tevatron.
In terms of development, I think we will survey a collection of interesting losses and add them together in a single PR. We are open to suggestions of other loss functions to include.
AFAIK, the latent space of CLOOB seems to be aligning text and image modalities much better than CLIP. Below are two plots i saw someone post on EleutherAI's discord where they created UMAP's on a small sample of image-text pairs (CLIP on top and CLOOB below)
Let me know if integrating this is in the works. It would be a great addition to the library. I can also ping here if I come across other interesting losses.
One question, do you have any expectation on what this loss will do to text (text only setup)?
My expectation is that in case of two tower setups, we might see better aligned embeddings. (I don't think this approach is meant for single tower setups)
Other than that, it's hard to say beforehand how much of an improvement we can expect.
I see. We will triage this through the weekend.
Hi, Thanks for the great work!
Do you think InfoLOOB (formulation here, implementation here) would be a good addition to this library? Seems like it outperforms InfoNCE in an image-text setting; I thought it might be worth experimenting with it on purely-text-based IR tasks