Open johnrobinsn opened 1 year ago
New cuda is slow AF too. I loaded the same exact GPTQv1 llama-30b 4bit model and performed inference.
Card: Quadro P6000
Old Cuda implementation:
Output generated in 6.92 seconds (4.19 tokens/s, 29 tokens, context 382)
Output generated in 8.72 seconds (5.05 tokens/s, 44 tokens, context 382)
New Cuda
Output generated in 26.57 seconds (1.09 tokens/s, 29 tokens, context 376)
Output generated in 26.02 seconds (1.11 tokens/s, 29 tokens, context 376)
Triton results for me.. using GPTQv2 models that load and work by setting backend to 'cuda'
python: /project/lib/Analysis/Utility.cpp:136: bool mlir::supportMMA(mlir::Value, int): Assertion `(version == 1 || version == 2) && "Unexpected MMA layout version found"' failed.
Aborted
I'm assuming training will follow the same story. New version will be slow.
Looks like there are some promising triton perf improvements in the upstream from the sterlind gptq-for-llama repo. Any thoughts on getting sterlind bits merged into the qwopqwop repo?
More curious about what others are seeing....
I'm fine-tuning llama30b on a 24G titan RTX card and using same everything (model, dataset (alpaca), hyperparameters, gradient checkpointing on) and just trying out fine-tuning with different the two backends (cuda vs triton).
Cuda is a little more than 3s/it and Triton is a little under 4s/it... I guess I was hoping/expecting triton to perform a bit better (but at least no worse...)
What are others seeing?
Thanks!