Add multi-modal method(s)

lightly-ai / lightly

A python library for self-supervised learning on images.

https://docs.lightly.ai/self-supervised-learning/

MIT License

2.89k stars 249 forks source link

Add multi-modal method(s) #1573

Open trawler0 opened 1 month ago

trawler0 commented 1 month ago

Hello guys, Thanks for this amazing repo, it is very useful for me. I wanted to ask if there is interest in implementing methods like CLIP for image-language pretraining. I understand that this might not be your actual focus and that web-scale-pretraining might be out of reach, however the paper https://arxiv.org/abs/2305.08675 shows that one can actually get relatively high zero-shot accuracies with effort roughly equal to imagenet pretraining.

guarin commented 1 month ago

Hi! Multi-modal is definitely something we would like to incorporate. There are two main components missing for this: Data loading for text, and NLP models/tokenizers. For both cases we have to decide how to support them. This was quite easy for vision because data loading is pretty standardized and models are in torchvision. For text the landscape is more diverse and we'll have to compare the libraries first. Please let us know if you have any suggestions/inputs!