unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.56k stars 1.3k forks source link

Why is memory bandwidth only half used? Is it possible we speed up by utilizing this? #1230

Open fzyzcjy opened 3 weeks ago

fzyzcjy commented 3 weeks ago

Hi thanks for the library! This is like a discussion (instead of an issue). It seems that when using unsloth or huggingface Trainer to full finetune ~1B model, the gpu utilization is >90%, while memory bandwidth is only half used (30%-70% depending on concrete experiments).

I have heard that memory bandwidth is quite important and often bounds the speed. For example, in https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/, the author says:

This means that when comparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth. For example, The A100 GPU has 1,555 GB/s memory bandwidth vs the 900 GB/s of the V100. As such, a basic estimate of speedup of an A100 vs V100 is 1555/900 = 1.73x.

Since my 4090 is faster in FLOPS while having smaller bandwidth compared with A100, I guess it should be even more severely bound by the memory bandwidth. That contradicts what seems to be seen by nvidia-smi when training the model.

Therefore, I wonder whether we can squeeze out some performance by boosting this value?

jchook commented 3 weeks ago

Andrej Karpathy also discusses this concept in his video Let's reproduce GPT-2 (124M).

See at 1:27:00 when he discusses the memory bandwidth bottleneck scenario, and at 1:52:00 when he covers the GPU memory architecture.

In theory, if memory bandwidth was slowing-down your training process, wouldn't you expect to see (closer to) 100% memory bandwidth utilization plus lower tensor core utilization? I.e., the GPU would complete a batch, then sit waiting around for another training batch to be transferred?

In your experiments, it seems like the GPU is tied-up completing the training batch workloads, and the CPU is waiting to schedule more batch transfers to the GPU.

fzyzcjy commented 3 weeks ago

@jchook Thank you. Yes, I would expect to see ~100 GPU memory bandwidth utilization + much lower cuda/tensor core utilization. But what I do see in experiments is exact the contrary. That's why I am confused.