Reproducing B/32 open clip accuracy reported in "LARGE SCALE OPENCLIP" blog post

yaoyu-33 commented 1 year ago

Hello here, I am currently trying to reproduce B/32 open clip results reported in https://laion.ai/blog/large-openclip/, but got some difficulty reproducing the number.

In 12B samples seen section, B/32 reported 62.9% for imagenet top1. But I was only able to get 40%-50% accuracy for zero-shot imagenet validation top1. One difference is I am using coyo-700m instead of laion-400m.

Here is my open clip args I am using to train with 14 nodes so total global batch is around 32k:

  --train-num-samples 676045000 \
  --dataset-type webdataset  \
  --batch-size 288 \
  --epochs 18 \
  --precision amp_bfloat16  \
  --workers 8 \
  --model ViT-B-32 \
  --lr 1e-3 \
  --warmup 2000 \
  --local-loss \

I was wondering if I am using the correct set of argument here. Thanks for any input in advance.

rom1504 commented 1 year ago

Accuracy is dependent on pretraining dataset

Do you have any information showing that coyo should be good ?

On Fri, Jan 27, 2023, 20:26 yaoyu-33 @.***> wrote:

Hello here, I am currently trying to reproduce B/32 open clip results reported in https://laion.ai/blog/large-openclip/, but got some difficulty reproducing the number.

In 12B samples seen section, B/32 reported 62.9% for imagenet top1. But I was only able to get 40%-50% accuracy. One difference is I am using coyo-700m instead of laion-400m.

Here is my open clip args I am using to train with 14 nodes so total global batch is around 32k:

--train-num-samples 676045000 \ --dataset-type webdataset \ --batch-size 288 \ --epochs 18 \ --precision amp_bfloat16 \ --workers 8 \ --model ViT-B-32 \ --lr 1e-3 \ --warmup 2000 \ --local-loss \

I was wondering if I am using the correct set of argument here. Thanks for any input in advance.

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437V2RAK4EJJ5EC3EH7DWUQOQBANCNFSM6AAAAAAUJBRHNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

yaoyu-33 commented 1 year ago

Hi, thanks for reply.

coyo repo (https://github.com/kakaobrain/coyo-dataset reported their dataset yield similar accuracy on ALIGN model w/ ALIGN 1.8B. They also reported their filtered version COYO-Labeled-300M yield similar accuracy on ViT compare to JFT300M.

They didn't provide much comparison to laion though. Do you think this is the main cause and no issues in HPs?

rom1504 commented 1 year ago

If you read carefully, they do not report any zero shot classification numbers. That points to these numbers not being good. But I advise you open an issue on their repo and ask directly

rom1504 commented 1 year ago

https://github.com/kakaobrain/coyo-dataset/issues/13 asked for you

rom1504 commented 1 year ago

I guess you could try using COYO-Labeled-300M which is a pseudo supervised dataset like JFT

It's not comparable to normal image/text dataset since it's not scalable but that might be ok for your application

yaoyu-33 commented 1 year ago

Thanks Romain, for all the suggestions and opening a issue for me! let's see if coyo guys have any feed back.

justHungryMan commented 1 year ago

Hi, is this what you're talking about?

rom1504 commented 1 year ago

I did mean COYO-Labeled-300M for which they report results

If they don't distribute the dataset then you can only ask them, I don't know more

You could also use laion datasets

justHungryMan commented 1 year ago

We reproted knn results on imagenet with coyo-align which uses coyo-700M and top-1 classification results on imagenet which uses coyo-vit with coyo-labeled-300M.

We carefully considered the hyperparameters to match on papers as much as possible.

It would be good to refer to the configuration of coyo-align Github, if you want to reproduce align model.

Coyo and laion are identically crawled from commoncrawls, and I think it is unlikely that there will be a dramatic performance difference due to data differences. (main difference is that laion provides more data 🤗)

So if you want to reproduce the results reported laion's openclip results, it is better to follow the experimental setup there. (e.g. laion-400M)

yaoyu-33 commented 1 year ago

@justHungryMan thanks for explaining. Thanks for the comments "Coyo and laion are identically crawled from commoncrawls, and I think it is unlikely that there will be a dramatic performance difference due to data differences." This is useful.

yaoyu-33 commented 1 year ago

@rom1504 Hi Romain, quick question here: Aside dataset difference, is it possible for you to confirm for "In 12B samples seen section, B/32 reported 62.9% for imagenet top1" (1) what is the batch size per gpu and (2) was local loss applied? Asking because larger batch size could boost accuracy quite a lot.

rom1504 commented 1 year ago

We reproted knn results on imagenet with coyo-align

Imagenet knn and imagenet zero shot classification are very different metrics. Do you have the zero shot classification metrics?

Coyo and laion are identically crawled from commoncrawls, and I think it is unlikely that there will be a dramatic performance difference due to data differences.

We have several experiments showing that clip filtering is crucial for performance. Coyo didn't do any such filtering unlike laion. So yes i do expect significant difference due to the data used.

rom1504 commented 1 year ago

Hi Romain, quick question here: Aside dataset difference, is it possible for you to confirm for "In 12B samples seen section, B/32 reported 62.9% for imagenet top1" (1) what is the batch size per gpu and (2) was local loss applied?

That was with a global BS of 32k. Yes we used local loss but that shouldn't have any impact

Larger BS seems to improve result of 0.5-1 point (we did you to 90k for some other runs)

justHungryMan commented 1 year ago

Yes @rom1504 . We don't have zero shot classification results on imagenet. I think @yaoyu-33 got confused with knn results :).

Please see the imagenet-knn metric below (written in align paper) For each image in the validation set of ImageNet, we retrieve its nearest neighbors from the training set w/ pre-trained image encoder.

We have several experiments showing that clip filtering is crucial for performance. Coyo didn't do any such filtering unlike laion. So yes i do expect significant difference due to the data used.

And I also agree this sentence. We did not do clip filtering on the COYO-700M, but we provide clip similarity so that users can customize their filtering strategies.

yaoyu-33 commented 1 year ago

Thanks @rom1504 and @justHungryMan for your explanations regarding laion vs. coyo.

We have several experiments showing that clip filtering is crucial for performance. Coyo didn't do any such filtering unlike laion. So yes i do expect significant difference due to the data used.

And I also agree this sentence. We did not do clip filtering on the COYO-700M, but we provide clip similarity so that users can customize their filtering strategies.

This does make a lot sense to me as well. Thanks again.

yaoyu-33 commented 1 year ago

@rom1504 I tried to run global loss vs. local loss side by side and I do see some differences (~3% imagenet top1 gap at 45k below). I am not using gather_with_grad option though.

yaoyu-33 commented 1 year ago

I found after using gather_with_grad, local loss and global loss show the same loss curve. They both converge better than without the flag.

mitchellnw commented 1 year ago

great to hear, it's expected that local_loss + gather_with_grad = global

will close this for now thanks!

mlfoundations / open_clip

Reproducing B/32 open clip accuracy reported in "LARGE SCALE OPENCLIP" blog post #385