mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.91k stars 957 forks source link

Easier tutorial for beginners! #461

Open knight4u13 opened 1 year ago

knight4u13 commented 1 year ago

I am new to the open clip, is there any easier way to quickly start training open clip?I am really confused about the Training part of readme.md... Especially there are so many parallel parts in Training part such as test and development. Also I don't know the format of dataset in Sample single-process running code.

mitchellnw commented 1 year ago

This is a great idea, is anyone interested in making one of these?

Are there any specific questions you have?

Datasets can be in csv or webdataset format (see https://github.com/rom1504/img2dataset) for a great way to download datasets in webdatset format.

lyprince commented 1 year ago

I've been working on a repo in my spare time to do this (working title: nanoCLIP, inspired by nanoGPT ofc). Happy to upstream it.

Something I've struggled with is "miniaturizing" the task. Ideally, I think a beginner would want something that could train in <= 6 hours (including dataset download time) in a free colab notebook (so needs to fit on a single T4). I've tried taking a subset of CC3M with high overlap with cifar10 labels, then doing zero-shot classification on cifar10. The results are not super compelling however.

Do you have any recommendations for smaller tasks? My suspicision with cifar10 is that images have resolution too low and that the ten classes are not particularly orthogonal.

HarmanDotpy commented 1 year ago

how large a subset did you take? I can try helping with expts if required. also how large was the clip model? if this doesnt work then one thing can be to just take a pretrained model and then fientune it and pre vs post finetune acuracy on cifar should be different to show the effect of finetuning. this would of course not give the experience of pretraining from scratch, but might still be useful

edit: one way is to also first select a "close to cifar" subset of the CC3M dataset. this can be done by first converting CC3M images to embeddings (using any of the open source CLIP models), as well as CIFAR train set images to embeddings. then running faiss to get nearest neighbors of cifar images in the cc3m corpus. If we take k nearest neighbors per image, we get k*(size of cifar) as the training set now. Hopefuly training CLIP on this would improve cifar performance.

Du5TCh3N commented 1 year ago

Hi, I'm also new to Open CLIP and found this repo great for testing out the image classification with the different models. Now I wanted to try to learn how to finetune a model with my own image dataset, but found the tutorial very confusing. Has there been any progress that was discussed in this thread? Or if you guys can point me to a tutorial that shows how to set up the dataset needed and the codes that I would need to run, it would be greatly appreciated.