How do I know it is pretraining works instead of longer finetuning epochs?

keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"

https://arxiv.org/abs/2301.03580

MIT License

1.4k stars 82 forks source link

How do I know it is pretraining works instead of longer finetuning epochs? #40

Closed rayleizhu closed 1 year ago

rayleizhu commented 1 year ago

I notice that you set the finetuning epochs as 200 or 400.

https://github.com/keyu-tian/SparK/blob/a64bdf729f9491b75d94163982750911e1f91234/downstream_imagenet/arg.py#L16

However,

the standard supervised training only runs for 300 epochs (w/o 1600 or 800 pretraining epochs).
in MIM works (e.g. MAE and ConvNextV2), they typically finetune 100 epochs.

Did you try 100 epoch schedule? Can you also kindly share the result under such a setting?

keyu-tian commented 1 year ago

We basically follow A2-MIM's 300-epoch finetuning setting (i.e., the Resnet Strikes Back/RSB A2), and set 200/400 ep for smaller/larger models respectively. We exclude the 100-epoch RSB A3 setting since it uses a different resolution (160), but if it is of interests we would have a try.

btw, the convnextv2 uses 400 or 600 for their smaller models.

rayleizhu commented 1 year ago

Thanks for your quick response.

We exclude the 100-epoch RSB A3 setting since it uses a different resolution (160), but if it is of interests we would have a try.

I think the 100 ep setting is important, otherwise, it is difficult for follow-up works to compare with existing works (Spark, ConvNextV2, etc.) in a fair way because of inconsistent evaluation protocols.

Besides, I think it is more reasonable to finetune pre-trained models with no more than 300 epochs, which is used in the supervised baseline. Otherwise, it is hard to say whether the performance gain comes from longer finetuning or better initialization provided by MIM.

keyu-tian commented 1 year ago

I see. But I would suggest not focusing too much on ImageNet finetuning. I feel the best way to justify whether MIM makes sense is to evaluate it on REAL downstream tasks (i.e., not on ImageNet), because doing pretraining and finetuning on the same dataset can be kind of like a "data leakage", and dosen't match our eventual goals of self-supervised learning.

On real downstream tasks (COCO object detection & instance segmentation), SparK can outperform Swin+MIM, Swin+Supervised, Conv+Supervised, Conv+Contrastive Learning, so these are like solid proofs of SparK's effectiveness.

rayleizhu commented 1 year ago

I see. But I would suggest not focusing too much on ImageNet finetuning. I feel the best way to justify whether MIM makes sense is to evaluate it on REAL downstream tasks (i.e., not on ImageNet), because doing pretraining and finetuning on the same dataset can be kind of like a "data leakage", and dosen't match our eventual goals of self-supervised learning.

This makes sense to me. Thanks for the explanation.

ds2268 commented 11 months ago

We basically follow A2-MIM's 300-epoch finetuning setting (i.e., the Resnet Strikes Back/RSB A2), and set 200/400 ep for smaller/larger models respectively. We exclude the 100-epoch RSB A3 setting since it uses a different resolution (160), but if it is of interests we would have a try.

btw, the convnextv2 uses 400 or 600 for their smaller models.

But for the B and H/L models, they use the same 50 and 100 epochs (ConvNext v2 paper, A.1, Table 11) fine-tuning schedule. It would be nice to compare apple to apples in terms of fine-tuning epochs. What are the results after 50 epochs SparK fine-tunning of the ConvNext-B and 100 epochs for ConvNext-H?