Open ydhongHIT opened 3 months ago
I think the bottleneck lies in the iterative generation of data by the CPU, which leads to low efficiency.
Got the same issue, could anyone explain for this phenomenon? At what snippets does CPU matters?😢
Have you tried using different number of worker? It looks like smaller batch size (says 128 for a single GPU) and 16 cpu worker is fairly reasonable to me. I have tried training a vim-tiny on imagenet-1k with 4xV100 (16 G), amp enabled, it takes around 4 seconds to run 10 iterations and around 17 mins to finish 1 epoch. The gpu utilities are at 100%. Perhaps it has something to the gpu bandwidth.
Thanks for your great works! However, I observe that the training efficiency (including the training speed and memory use) is much lower than that of the plain ViT with a similar mode size. Do you have any insights on this phenomenon?
The code sets block number to 24 for small and tiny, which is two times as normal ViT/small,tiny. And I really can't understand why.
Thanks for your great works! However, I observe that the training efficiency (including the training speed and memory use) is much lower than that of the plain ViT with a similar mode size. Do you have any insights on this phenomenon?