johnsmith0031 / alpaca_lora_4bit

MIT License
534 stars 84 forks source link

Triton Backend Trains Slower Than Cuda? #66

Open johnrobinsn opened 1 year ago

johnrobinsn commented 1 year ago

More curious about what others are seeing....

I'm fine-tuning llama30b on a 24G titan RTX card and using same everything (model, dataset (alpaca), hyperparameters, gradient checkpointing on) and just trying out fine-tuning with different the two backends (cuda vs triton).

Cuda is a little more than 3s/it and Triton is a little under 4s/it... I guess I was hoping/expecting triton to perform a bit better (but at least no worse...)

What are others seeing?

Thanks!

Ph0rk0z commented 1 year ago

New cuda is slow AF too. I loaded the same exact GPTQv1 llama-30b 4bit model and performed inference.

Card: Quadro P6000

Old Cuda implementation:

Output generated in 6.92 seconds (4.19 tokens/s, 29 tokens, context 382)
Output generated in 8.72 seconds (5.05 tokens/s, 44 tokens, context 382)

New Cuda

Output generated in 26.57 seconds (1.09 tokens/s, 29 tokens, context 376)
Output generated in 26.02 seconds (1.11 tokens/s, 29 tokens, context 376)

Triton results for me.. using GPTQv2 models that load and work by setting backend to 'cuda'

python: /project/lib/Analysis/Utility.cpp:136: bool mlir::supportMMA(mlir::Value, int): Assertion `(version == 1 || version == 2) && "Unexpected MMA layout version found"' failed.
Aborted

I'm assuming training will follow the same story. New version will be slow.

johnrobinsn commented 1 year ago

Looks like there are some promising triton perf improvements in the upstream from the sterlind gptq-for-llama repo. Any thoughts on getting sterlind bits merged into the qwopqwop repo?