Closed ZhiyuanLi218 closed 3 months ago
About 9 days.
Mamba is slower than Transformer when the input sequence is short. We have analyzed the speed in Figure 3 of our paper.
Other papers like [1] also find such a phenomenon.
[1] Yang S, Wang B, Shen Y, et al. Gated linear attention transformers with hardware-efficient training[J]. arXiv preprint arXiv:2312.06635, 2023.
Hi, excuse me. How much time did it take to train a Large model of Imagenet 256*256 with 8*A100 80G?