unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.4k stars 1.29k forks source link

Overlap matrix multiplication (needs tensor core) and other things like activation (needs cuda core and memory bandwidth) to speed up #1233

Open fzyzcjy opened 2 weeks ago

fzyzcjy commented 2 weeks ago

Hi thanks for the library! I have a naive thought: We know deep learning forward/backward cannot be parallelized, because you have to compute one operation/layer before computing the next one. But what if we have two batches computed almost parallelly? Then, for example, when our first batch is computing a big matrix multiplication (i.e. tensor core busy, cuda core idle, memory bandwidth idle), maybe we can issue some cuda instructions to compute the activation functions (tensor core idle, cuda core busy, memory bandwidth somehow busy).

Also discussions here: https://forums.developer.nvidia.com/t/concurrent-execution-of-cuda-and-tensor-cores/222985/33?u=ch271828n

Cross-posted: https://github.com/linkedin/Liger-Kernel/issues/341

danielhanchen commented 2 weeks ago

Oh interesting - maybe CUDA Streams is what you're looking for? Ie you can overlap 2 CUDA operations on 1 GPU. Generally it's not that worth it, but it could work - the main issue though is if using Python CUDA streams, the overhead of Python might make it actually slower - so I guess some sort of C++ CUDA stream orchestration would work.

fzyzcjy commented 2 weeks ago

@danielhanchen Thank you, that looks interesting! Is there any possibilities Unsloth uses this (to speed up)?