Open fzyzcjy opened 2 weeks ago
Oh interesting - maybe CUDA Streams is what you're looking for? Ie you can overlap 2 CUDA operations on 1 GPU. Generally it's not that worth it, but it could work - the main issue though is if using Python CUDA streams, the overhead of Python might make it actually slower - so I guess some sort of C++ CUDA stream orchestration would work.
@danielhanchen Thank you, that looks interesting! Is there any possibilities Unsloth uses this (to speed up)?
Hi thanks for the library! I have a naive thought: We know deep learning forward/backward cannot be parallelized, because you have to compute one operation/layer before computing the next one. But what if we have two batches computed almost parallelly? Then, for example, when our first batch is computing a big matrix multiplication (i.e. tensor core busy, cuda core idle, memory bandwidth idle), maybe we can issue some cuda instructions to compute the activation functions (tensor core idle, cuda core busy, memory bandwidth somehow busy).
Also discussions here: https://forums.developer.nvidia.com/t/concurrent-execution-of-cuda-and-tensor-cores/222985/33?u=ch271828n
Cross-posted: https://github.com/linkedin/Liger-Kernel/issues/341