Kernel tuning and benchmarking

unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

15.32k stars 1.03k forks source link

Kernel tuning and benchmarking #131

Open cm2435 opened 7 months ago

cm2435 commented 7 months ago

Hey! Just opening an issue because there doesn't seem to be a discussion board.

I noticed there's no tuning around all of the triton kernels for things like block size and not much coverage around if the kernels are really faster than torch native opsets.

Is this a conscious decision? Kernel tuning takes time, in the order of adding perhaps a min to the model fitting, but equally for training job that takes more than an hour a tuned kernel that is even $1/60 = 1.6$% faster would pay for itself which is a quite low bar.

I'd happily go cherry-pick my kernel testing from the PHI-2 implementation branch and write some simple benchmarks with them so we can test the impacts of kernel tuning if there's interest?

danielhanchen commented 7 months ago

@cm2435 Oh fair point on auto tuning on block sizes - I found 1024 approx to be reasonably OK on Tesla T4 and A100s. I think I tuned some myself by hand, so technically I did do some tuning, just not auto-tuning :) There's actually an auto tuner in Triton which allows you to auto select the fastest options. I do agree you can squeeze even more out :)

cm2435 commented 7 months ago

@danielhanchen Yeah that was what I was going to PR; the thing is that the triton autotuner has a little overhead to it because it tries a big combinatorial list of block and warp sizes and then picks the fastest one for your specific matrix shape. So I was wondering if it was worth the tradeoff or at least measuring.

danielhanchen commented 7 months ago

Ye agreed - in fact the overhead is kinda annoying LOLL - I remember it was 2-5ms. The issue is one has to benchmark across T4, A100s and other GPUs. Another better approach is before the kernel runs, we "patch" the Triton auto-dispatcher to only call the only best one - this can be done but it'll require some work on the auto patching side of things