sail-sg / poolformer

PoolFormer: MetaFormer Is Actually What You Need for Vision (CVPR 2022 Oral)
https://arxiv.org/abs/2111.11418
Apache License 2.0
1.3k stars 117 forks source link

On the use of Apex AMP and hybrid stages #22

Open DonkeyShot21 opened 2 years ago

DonkeyShot21 commented 2 years ago

Is there a specific reason why you used Apex AMP instead of the native AMP provided by PyTorch? Have you tried native AMP?

I tried to train poolformer_s12 and poolformer_s24 with solo-learn; with native fp16 the loss goes to nan after a few epochs, while with fp32 it works fine. Did you experience similar behavior?

On a side note, can you provide the implementation and the hyperparameters for the hybrid stage [Pool, Pool, Attention, Attention]? It seems very interesting!

yuweihao commented 2 years ago

Hi @DonkeyShot21 , Thanks for your attention.

We only used to train poolformer_s12 with Apex AMP and it works well, so we use the Apex AMP to show how to train it with Apex AMP on GPUs. We have not tested it with native AMP, and thus have no experience in native AMP.

We plan to release the implementation and more trained models of hybrid stages around March. As for [Pool, Pool, Attention, Attention]-S12 (81.0% accuracy) shown in the paper, we trained it with LayerNorm, batch size of 1024, the learning rate of 1e-3. The remained hyper-parameters are the same as poolformer_s12. The implementation of the pooling token mixer is the same as that of PoolFormer. After the first two stages, the position embedding is added. The attention token mixer is similar to that in timm. The difference is that as the default data format of our implementation is [B, C, H, W], the input of the attention token mixer is transformed into [B, N, C], and the output is transformed into [B, C, H, W] again.

DonkeyShot21 commented 2 years ago

Hi @yuweihao, thanks for the nice reply.

Apex can be hard to install without sudo, that is why I prefer native AMP. Actually, I have tried both (Apex, native) with solo-learn and both lead to nans in the loss quite quickly. This also happens with Swin and ViT. I am trying your implementation now with native AMP and it seems it works nicely, the logs are similar to the ones you posted on google drive. So I guess my problem is related to the SSL methods or to the fact that solo-learn does not support mixup and cutmix. The only way I could stabilize training was with SGD + LARS and gradient accumulation (to simulate a large batch size), but the results are very bad, much worse than ResNet18. I guess SGD is not a good match for metaformers in general.

Thanks for the details on the hybrid stage. I have also seen in other issues that you said that depthwise convs can also be used instead of pooling with a slight increase in performance. Do you think this can be paired with the hybrid stages as well (e.g. depthwise conv in the first 2 blocks and then attention in the last 2)?

yuweihao commented 2 years ago

Hi @DonkeyShot21 , thanks for your wonderful works for the research community:)

Yes, [DWConv, DWConv, Attention, Attention] also works very well and it is in our release plan of models with hybrid stages.

DonkeyShot21 commented 2 years ago

Thank you again! Looking forward to the release!

DonkeyShot21 commented 2 years ago

Hey @yuweihao, sorry to bother you again. For the hybrid stage [Pool, Pool, Attention, Attention] did you use layer norm just for the attention blocks or for the pooling blocks as well? I am trying to reproduce it on ImageNet-100 but I didn't get better performance than vanilla poolformer. The params and flops are the same as you reported, so I guess the implementation should be correct.

yuweihao commented 2 years ago

Hi @DonkeyShot21 , I use layer norm for all [Pool, Pool, Attention, Attention]-S12 blocks. I guess the attention blocks may be easy to overfit on small datasets, which results in worse performance than vanilla poolformer on ImageNet-100.