Closed yyNoBug closed 1 month ago
Have you installed DeepSpeed and use Zero-2?
pip install deepspeed
Thanks! 768 batch size is fine after I installed deepspeed.
By the way, the training now takes 3 seconds per step. Is that normal?
Yes, 3s/iter is normal. We take more than 20 days to train for 625K iterations.
As for its speed on $256 × 256$ images, Mamba is inferior to Transformer due to its double number of scans. You can refer to Figure 3 in our paper for the details of speed.
Dear authors! Thanks for your great work. I tried training a DiM model with your provided config
configs/imagenet256_H_DiM.py
, but it triggers the CUDA out of memory error. I am running the original code you have provided on 8 A100 GPUs. Do you have any idea what might be the issue?