Significant performance degradation between 0.11 and 0.12 when used from ooba

aikitoria commented 8 months ago

I originally posted this in the ooba repo but he suggested to also post it here for visibility. https://github.com/oobabooga/text-generation-webui/issues/5383

The commit updating exllamav2 to 0.12 in ooba reduces my tokens/s for Goliath 120b by over 30% on A100 or H100 GPUs. Exact same generation settings and same model otherwise. What could be the problem?

0.11 through ooba: 14-15 tokens/s on A100 0.12 through ooba: 10-11 tokens/s on A100

Beinsezii commented 8 months ago

It's even worse on ROCm w/ 7900 XTX. Possibly because the bottleneck is greater at higher speeds? Or maybe because I don't have Flash Attention.

python test_inference.py -m "Beinsezii_ReMM-v2.2-L2-13B-EXL2_h8_b8" -l 4096 -s

Exllama 0.0.11:

 ** Position     1 + 127 tokens:   54.1357 t/s
 ** Position   128 + 128 tokens:   54.0354 t/s
 ** Position   256 + 128 tokens:   53.5179 t/s
 ** Position   384 + 128 tokens:   53.0453 t/s
 ** Position   512 + 128 tokens:   52.3863 t/s
 ** Position   640 + 128 tokens:   51.9039 t/s
 ** Position   768 + 128 tokens:   51.3843 t/s
 ** Position   896 + 128 tokens:   50.9276 t/s
.....
 ** Position  3968 + 128 tokens:   38.3066 t/s

Exllama 0.0.12:

 ** Position     1 + 127 tokens:   24.8812 t/s
 ** Position   128 + 128 tokens:   25.0801 t/s
 ** Position   256 + 128 tokens:   24.9860 t/s
 ** Position   384 + 128 tokens:   24.8973 t/s
 ** Position   512 + 128 tokens:   24.6968 t/s
 ** Position   640 + 128 tokens:   24.6020 t/s
 ** Position   768 + 128 tokens:   24.4626 t/s
 ** Position   896 + 128 tokens:   24.3808 t/s
.....
 ** Position  3968 + 128 tokens:   20.9329 t/s

Oobabooga webui shows similar numbers, with it taking roughly twice the time for a response of the same token length.

Wonder if it's because of

Return probs from streaming generator

But there's no way to expose that in test_inference or booba.

I tried both the rocm 5.6 precompiled kernel wheel and the JIT compiled against my system's ROCm 6.0

Ph0rk0z commented 8 months ago

Log probs was made optional in another commit.

aikitoria commented 8 months ago

Tracked down the problematic commit in the other issue: https://github.com/oobabooga/text-generation-webui/issues/5383#issuecomment-1913878870

turboderp commented 8 months ago

I wasn't able to measure any performance degradation with that change on any of my GPUs, but I would presume it's because of the minimum batch size of 4 using more SMEM than needed, so I've reverted it somewhat in the latest commit.

I'll soon have more options for testing here, but please try it out in the meantime.

Beinsezii commented 8 months ago

I would presume it's because of the minimum batch size of 4 using more SMEM than needed, so I've reverted it somewhat in the latest commit.

That seems to have gotten back 95% of the lost speed on my XTX which I think is acceptable if the batched performance is still improved enough despite the partial revert.

Using test_inference.py on 8bit Llama13b again, rocm 6.0, git clean -fd to clear hip build files after every checkout...

New min batch fix @ 30fe6e7a7cfc33d554de31c55e3b1f046e13740f

 ** Position     1 + 127 tokens:   50.9972 t/s
 ** Position   128 + 128 tokens:   51.1531 t/s
.....
 ** Position  3840 + 128 tokens:   37.5028 t/s
 ** Position  3968 + 128 tokens:   37.0016 t/s

Re-test 0.0.12 @ f94efb3a0fcee687162870b71fdc575247401678

 ** Position     1 + 127 tokens:   24.8753 t/s
 ** Position   128 + 128 tokens:   24.9537 t/s
.....
 ** Position  3840 + 128 tokens:   20.9512 t/s
 ** Position  3968 + 128 tokens:   20.8266 t/s

Re-test 0.0.11 @ a4ecea6d57c1bb231dfd06acf5b0454e5bff0bd8

 ** Position     1 + 127 tokens:   53.7663 t/s
 ** Position   128 + 128 tokens:   53.5430 t/s
.....
 ** Position  3840 + 128 tokens:   38.5868 t/s
 ** Position  3968 + 128 tokens:   38.0602 t/s

I wasn't able to measure any performance degradation with that change on any of my GPUs

Have you benchmarked without Flash Attention? My 3090 friend on windows without flash attention supposedly got a similar 50% hit with 0.0.12 over 0.0.11. I know ideally everyone should be running FA2 but right now the rocm navi fork is too far behind upstream and cuda/win seems prone to miscompiles that your average end-user can't resolve.

turboderp commented 8 months ago

FA2 isn't noticeably faster than the fallback matmul attention on short sequences, so you shouldn't see a performance drop until you get to longer contexts.

I checked again on my 3090Ti and didn't see a difference between 0.0.11 and 0.0.12, but there was a slight degradation on the regular 3090. Only a few % though and nothing near 50%. So there's likely something going on with Windows as well.

There are other changes between 0.0.11 and 0.0.12 too, one of them being that flash-attn is disabled when doing non-causal attention (e.g. with padding masks), since that wasn't working correctly. That's a temporary fix, though, only until flash-attn eventually supports attention masking/bias.

I've also been looking at other solutions for attention, including xformers which also seems to have a ROCm port.

Beinsezii commented 8 months ago

The xformers rocm port doesn't work on navi gpus yet lol. IIRC the rocm composable kernel has a branch for "all-arches" with modules needed for FA and likes so in theory it's coming soon™. However in practice it took like 6 months after release before pytorch could detect my card and until yesterday for it to stop randomly entering page fault hell, so I hold my breath.

The only mem efficient attention that's worked 100% of the time so far is "sub-quadratic attention" which afaik is a pure pytorch re-impl of some of the flash attention techniques.

Ph0rk0z commented 8 months ago

xformers works with all my old GPUs and rights their calculations to FP32. I get fast SD outputs using it.

aikitoria commented 8 months ago

It looks like that change improved it quite a bit, but it's still not entirely as fast as 0.0.11 was.

tags/v0.0.11:

Output generated in 8.78 seconds (14.47 tokens/s, 127 tokens, context 1682, seed 2142719412)
Output generated in 8.84 seconds (14.37 tokens/s, 127 tokens, context 1682, seed 131042970)

master yesterday:

Output generated in 11.11 seconds (11.43 tokens/s, 127 tokens, context 1682, seed 295294536)
Output generated in 11.12 seconds (11.42 tokens/s, 127 tokens, context 1682, seed 245566471)

master today:

Output generated in 9.42 seconds (13.48 tokens/s, 127 tokens, context 1162, seed 588466037)
Output generated in 9.42 seconds (13.48 tokens/s, 127 tokens, context 1162, seed 1275382615)

aikitoria commented 8 months ago

Experimenting a bit with the defines that commit also changed. Not sure what a Q_GEMM row is, but it seems to affect the performance:

Current

#define MAX_Q_GEMM_ROWS 32
#define MAX_Q_GEMM_ROWS_KERNEL 4

Output generated in 9.42 seconds (13.48 tokens/s, 127 tokens, context 1162, seed 588466037)
Output generated in 9.42 seconds (13.48 tokens/s, 127 tokens, context 1162, seed 1275382615)

Test 1 (fail)

#define MAX_Q_GEMM_ROWS 50
#define MAX_Q_GEMM_ROWS_KERNEL 4

Output generated in 9.40 seconds (13.51 tokens/s, 127 tokens, context 1162, seed 2128836875)
Output generated in 9.36 seconds (13.57 tokens/s, 127 tokens, context 1162, seed 1816688436)

Test 2 (fail)

#define MAX_Q_GEMM_ROWS 32
#define MAX_Q_GEMM_ROWS_KERNEL 6

Output generated in 9.49 seconds (13.38 tokens/s, 127 tokens, context 1162, seed 1968295458)
Output generated in 9.36 seconds (13.56 tokens/s, 127 tokens, context 1162, seed 590793050)

Test 3 (now we fully got the performance back!)

#define MAX_Q_GEMM_ROWS 50
#define MAX_Q_GEMM_ROWS_KERNEL 6

Output generated in 8.65 seconds (14.68 tokens/s, 127 tokens, context 1162, seed 1891778251)
Output generated in 8.72 seconds (14.56 tokens/s, 127 tokens, context 1162, seed 1841075734)

Test 4 (no further improvement)

#define MAX_Q_GEMM_ROWS 64
#define MAX_Q_GEMM_ROWS_KERNEL 6

Output generated in 8.74 seconds (14.53 tokens/s, 127 tokens, context 1162, seed 350804564)
Output generated in 8.75 seconds (14.52 tokens/s, 127 tokens, context 1162, seed 1198205575)

Test 5 (no further improvement)

#define MAX_Q_GEMM_ROWS 50
#define MAX_Q_GEMM_ROWS_KERNEL 8

Output generated in 8.70 seconds (14.59 tokens/s, 127 tokens, context 1162, seed 1355664497)
Output generated in 8.77 seconds (14.48 tokens/s, 127 tokens, context 1162, seed 1384194502)

Test 6 (no further improvement)

#define MAX_Q_GEMM_ROWS 64
#define MAX_Q_GEMM_ROWS_KERNEL 8

Output generated in 8.72 seconds (14.57 tokens/s, 127 tokens, context 1162, seed 832394660)
Output generated in 8.67 seconds (14.64 tokens/s, 127 tokens, context 1162, seed 399488503)

Ph0rk0z commented 8 months ago

Heh, this is like the llama.cpp hack.

turboderp commented 8 months ago

@aikitoria There's no sane reason why you'd get more speed with MAX_Q_GEMM_ROWS = 50. All that does is change the threshold at which the matmuls are performed with reconstructed FP16 weights instead of quantized weights, but you'll be nowhere near that threshold in your test. You're doing either one token at a time (1 row) or prompt ingestion (1162 rows).

Is it possible you have something else running on the GPU? The smallest thing can make a difference in that regard. Best way to test it would be to run python test_inference.py -m <model> -s in a terminal with no other apps open. Also depending on your DE you might want to minimize the terminal.

MAX_Q_GEMM_ROWS_KERNEL is the max number of rows the kernel will treat in each block, but it doesn't matter in either of the above cases. It would matter if you're doing inference on between 5 and 49 rows, but the effect would be a segfault most likely since you'd also have to actually compile the kernel instances for more than 4 rows (see the compilation units in exllamav2/exllamav2_ext/cuda/comp_units.

aikitoria commented 8 months ago

Very strange. It's a server from RunPod (specifically, the one with 1x A100 SXM 80GB so it can run Goliath sensibly). There's nothing running on the GPU, not even a desktop environment (nvidia-smi shows 4MB usage on idle). Let me confirm with the test_inference.py

aikitoria commented 8 months ago

The current:

 -- Measuring token speed...
 ** Position     1 + 127 tokens:   16.0222 t/s
 ** Position   128 + 128 tokens:   16.0825 t/s
 ** Position   256 + 128 tokens:   16.0701 t/s
 ** Position   384 + 128 tokens:   16.0528 t/s
 ** Position   512 + 128 tokens:   16.0472 t/s
 ** Position   640 + 128 tokens:   16.0284 t/s
 ** Position   768 + 128 tokens:   16.0101 t/s
 ** Position   896 + 128 tokens:   15.9776 t/s
 ** Position  1024 + 128 tokens:   15.9675 t/s
 ** Position  1152 + 128 tokens:   15.9456 t/s
 ** Position  1280 + 128 tokens:   15.9176 t/s
 ** Position  1408 + 128 tokens:   15.9095 t/s
 ** Position  1536 + 128 tokens:   15.8921 t/s
 ** Position  1664 + 128 tokens:   15.9030 t/s
 ** Position  1792 + 128 tokens:   15.8858 t/s
 ** Position  1920 + 128 tokens:   15.8850 t/s
 ** Position  2048 + 128 tokens:   15.8728 t/s

With the changed defines:

 -- Measuring token speed...
 ** Position     1 + 127 tokens:   16.0575 t/s
 ** Position   128 + 128 tokens:   16.1216 t/s
 ** Position   256 + 128 tokens:   16.1133 t/s
 ** Position   384 + 128 tokens:   16.0955 t/s
 ** Position   512 + 128 tokens:   16.0785 t/s
 ** Position   640 + 128 tokens:   16.0545 t/s
 ** Position   768 + 128 tokens:   16.0373 t/s
 ** Position   896 + 128 tokens:   16.0261 t/s
 ** Position  1024 + 128 tokens:   16.0049 t/s
 ** Position  1152 + 128 tokens:   15.9899 t/s
 ** Position  1280 + 128 tokens:   15.9709 t/s
 ** Position  1408 + 128 tokens:   15.9531 t/s
 ** Position  1536 + 128 tokens:   15.9343 t/s
 ** Position  1664 + 128 tokens:   15.9491 t/s
 ** Position  1792 + 128 tokens:   15.9297 t/s
 ** Position  1920 + 128 tokens:   15.9249 t/s
 ** Position  2048 + 128 tokens:   15.9114 t/s

There is, at best, an absolutely miniscule difference here (probably just noise). But then... what was going on with ooba? Going to re-test this on a different GPU server.

aikitoria commented 8 months ago

I can't reproduce it on the other server. Maybe there was something else affecting the host.

Beinsezii commented 7 months ago

Should this be closed then if it's no longer reproducible?

aikitoria commented 7 months ago

Not sure, @turboderp mentioned wanting to test more things, but the original issue is fixed with that commit.

turboderp / exllamav2

Significant performance degradation between 0.11 and 0.12 when used from ooba #302