sail-sg / metaformer

MetaFormer Baselines for Vision (TPAMI 2024)
https://arxiv.org/abs/2210.13452
Apache License 2.0
412 stars 26 forks source link

More gpu hours in training. #10

Closed hkzhang-git closed 1 year ago

hkzhang-git commented 1 year ago

Hi, authors, this is an impressive work.

I have conducted experiments to train caformer-s18 using both the training code provided in ConvNext and the office training code provided in your repository. Surprisingly, training with the ConvNext code only took about 50 hours, while training with the code provided in your repository required approximately 150 hours. I'm curious to know what caused such a significant difference.

For my experiments, I used 8 RTX3090 GPUs, each with 24GB of memory.

Furthermore, I noticed that Caformers no longer include position embeddings. I'm wondering if this change could potentially harm the model's performance.

yuweihao commented 1 year ago

Hi @hkzhang91 ,

Many thanks for your attention. Sorry for the late response because I am very busy for the past several days.

For the speed, I benchmark the two models in an A100 and also do notice that CAFormer-S18 is slower than ConvNeXt in an A100.

Model Train TP (cf) Infer. TP (cf) Train TP (cl) Infer. TP (cl)
ConvNeXt-T 575 1903 495 2413
CAFormer-S18 349 1511 361 1602

PS: TP means throughput, cf means channels first memory layout and cl denotes channels last.

I guess the reason is the implementation. The implementation in this repo is for elegance to build different models. Some implementation is not efficient like StarReLU which is not optimized in CUDA. Timm implements some models like CoAtNet and MaxViT that are similar to CAFormer. Maybe you can also learn some efficient implements from it.

For position embeddings in CAformers, I used to add position embeddings before stage 3 but cannot see improvement. The reason is that the first two stages have token mixers of convolution, making each patch "knows" which patches are near it.

hkzhang-git commented 1 year ago

Thanks for your reply.