Closed hkzhang-git closed 1 year ago
Hi @hkzhang91 ,
Many thanks for your attention. Sorry for the late response because I am very busy for the past several days.
For the speed, I benchmark the two models in an A100 and also do notice that CAFormer-S18
is slower than ConvNeXt
in an A100.
Model | Train TP (cf) | Infer. TP (cf) | Train TP (cl) | Infer. TP (cl) |
---|---|---|---|---|
ConvNeXt-T | 575 | 1903 | 495 | 2413 |
CAFormer-S18 | 349 | 1511 | 361 | 1602 |
PS: TP
means throughput, cf
means channels first memory layout and cl
denotes channels last.
I guess the reason is the implementation. The implementation in this repo is for elegance to build different models. Some implementation is not efficient like StarReLU
which is not optimized in CUDA. Timm implements some models like CoAtNet and MaxViT that are similar to CAFormer. Maybe you can also learn some efficient implements from it.
For position embeddings in CAformers, I used to add position embeddings before stage 3 but cannot see improvement. The reason is that the first two stages have token mixers of convolution, making each patch "knows" which patches are near it.
Thanks for your reply.
Hi, authors, this is an impressive work.
I have conducted experiments to train caformer-s18 using both the training code provided in ConvNext and the office training code provided in your repository. Surprisingly, training with the ConvNext code only took about 50 hours, while training with the code provided in your repository required approximately 150 hours. I'm curious to know what caused such a significant difference.
For my experiments, I used 8 RTX3090 GPUs, each with 24GB of memory.
Furthermore, I noticed that Caformers no longer include position embeddings. I'm wondering if this change could potentially harm the model's performance.