Questions about training time

tyshiwo1 / DiM-DiffusionMamba

The official implementation of DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

150 stars 8 forks source link

Closed ZhiyuanLi218 closed 3 months ago

ZhiyuanLi218 commented 4 months ago

Hi, excuse me. How much time did it take to train a Large model of Imagenet 256*256 with 8*A100 80G?

tyshiwo1 commented 4 months ago

About 9 days.

Mamba is slower than Transformer when the input sequence is short. We have analyzed the speed in Figure 3 of our paper.

Other papers like [1] also find such a phenomenon.

[1] Yang S, Wang B, Shen Y, et al. Gated linear attention transformers with hardware-efficient training[J]. arXiv preprint arXiv:2312.06635, 2023.