hustvl / Vim

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Apache License 2.0
2.54k stars 159 forks source link

Much lower training efficiency #54

Open ydhongHIT opened 3 months ago

ydhongHIT commented 3 months ago

Thanks for your great works! However, I observe that the training efficiency (including the training speed and memory use) is much lower than that of the plain ViT with a similar mode size. Do you have any insights on this phenomenon?

yrqUni commented 3 months ago

I think the bottleneck lies in the iterative generation of data by the CPU, which leads to low efficiency.

Leopold2333 commented 2 months ago

Got the same issue, could anyone explain for this phenomenon? At what snippets does CPU matters?😢

zhuqiangLu commented 2 months ago

Have you tried using different number of worker? It looks like smaller batch size (says 128 for a single GPU) and 16 cpu worker is fairly reasonable to me. I have tried training a vim-tiny on imagenet-1k with 4xV100 (16 G), amp enabled, it takes around 4 seconds to run 10 iterations and around 17 mins to finish 1 epoch. The gpu utilities are at 100%. Perhaps it has something to the gpu bandwidth.

jsrdcht commented 1 month ago

Thanks for your great works! However, I observe that the training efficiency (including the training speed and memory use) is much lower than that of the plain ViT with a similar mode size. Do you have any insights on this phenomenon?

The code sets block number to 24 for small and tiny, which is two times as normal ViT/small,tiny. And I really can't understand why.