Open fzyzcjy opened 3 weeks ago
Andrej Karpathy also discusses this concept in his video Let's reproduce GPT-2 (124M).
See at 1:27:00 when he discusses the memory bandwidth bottleneck scenario, and at 1:52:00 when he covers the GPU memory architecture.
In theory, if memory bandwidth was slowing-down your training process, wouldn't you expect to see (closer to) 100% memory bandwidth utilization plus lower tensor core utilization? I.e., the GPU would complete a batch, then sit waiting around for another training batch to be transferred?
In your experiments, it seems like the GPU is tied-up completing the training batch workloads, and the CPU is waiting to schedule more batch transfers to the GPU.
@jchook Thank you. Yes, I would expect to see ~100 GPU memory bandwidth utilization + much lower cuda/tensor core utilization. But what I do see in experiments is exact the contrary. That's why I am confused.
Hi thanks for the library! This is like a discussion (instead of an issue). It seems that when using unsloth or huggingface Trainer to full finetune ~1B model, the gpu utilization is >90%, while memory bandwidth is only half used (30%-70% depending on concrete experiments).
I have heard that memory bandwidth is quite important and often bounds the speed. For example, in https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/, the author says:
Since my 4090 is faster in FLOPS while having smaller bandwidth compared with A100, I guess it should be even more severely bound by the memory bandwidth. That contradicts what seems to be seen by
nvidia-smi
when training the model.Therefore, I wonder whether we can squeeze out some performance by boosting this value?