turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

FP16 + ROCm Possibly Subpar Performance #428

Closed Beinsezii closed 2 weeks ago

Beinsezii commented 2 months ago

When using Exllama2 as a loader for f16 it seems my 7900 XTX @ 32 t/s falls very short of a 3090 @ 40 t/s, yet loading exl2 formatted models I'm usually equal or faster.

I wouldn't put it passed ROCm just being jank but if there's some tunable params worth looking into I'd be willing to compile some builds locally to measure impact.

turboderp commented 2 months ago

There are lots of places where the kernels could be tuned, and it would make sense to look at where the cuBLAS functions are translated into hipBLAS to see if everything's working optimally there.

I do have a PC set up with a 7900XTX for this purpose, but I've been put off a little bit by the lack of good profiling tools for ROCm. Any suggestions?

Beinsezii commented 2 months ago

Not much off the top of my head.

The newer blaslt lib seems to actually work on my card now as of rocm 6.0 + torch 2.3 RC-final. I don't think it has fallbacks for rdna <=2 cards though, so it'd need a separate code path assuming its even accessible in the torch hipify. I got bitsandbytes working without infs at least.

I was curious and it seems their shiny new profiler is only for the MI cards so for navi its just rocprof2...

If anything maybe you could make an upstream issue report or two since there's pretty good motivation? Their communication usually isnt great but at one point we squeaked enough they made a barely functional flash attention kernel for SD 15/XL.

turboderp commented 2 weeks ago

I'll close this for now, but I've added blaslt to my ever-growing list of things I need to look at eventually. Generally though I can't feel too optimistic about ROCm as long as the flash-attn ROCm fork is neglected. AMD's engineers should be all over that if they were serious about open-source ML.