turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.52k stars 271 forks source link

Dynamic gen is slower?! #469

Closed Ph0rk0z closed 3 months ago

Ph0rk0z commented 4 months ago

Trying out dynamic gen in TabbyAPI vs textgen on command-r+

3x3090 with FA.

14:23:53-347821 INFO     Loaded "command-r-plus-103B-exl2-4.5bpw" in 19.20 seconds.                                                                                              
14:23:53-349262 INFO     LOADER: "ExLlamav2_HF"                                                                                                                                  
14:23:53-350394 INFO     TRUNCATION LENGTH: 16384                                                                                                                                
14:23:53-351283 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                                                           
Output generated in 39.87 seconds (10.38 tokens/s, 414 tokens, context 1494, seed 219301613)
Output generated in 27.27 seconds (11.33 tokens/s, 309 tokens, context 1494, seed 376477788)

vs

INFO:     192.168.1.199:58228 - "POST /v1/completions HTTP/1.1" 200
INFO:     Metrics: 167 tokens generated in 21.39 seconds (Queue: 0.0 s, Process: 626.24 T/s, Generate: 8.79 T/s, Context: 1494 tokens) 
INFO:     192.168.1.199:54584 - "POST /v1/completions HTTP/1.1" 200
INFO:     Metrics: 299 tokens generated in 32.78 seconds (Queue: 0.0 s, Process: 308821.16 T/s, Generate: 9.12 T/s, Context: 1494 tokens) 

Am only setting the batch to 1, is this normal?

turboderp commented 4 months ago

In all my measurements the dynamic generator has been slightly faster than the old streaming generator at bsz 1 (and of course much faster with bsz 2+). There might be extra overhead because of the new control flow in Tabby, though.

Also it's hard to say if TGW and Tabby are measuring tokens/s exactly the same. A better comparison would be to the previous version of Tabby.

Ph0rk0z commented 4 months ago

I think if I tried the previous version of tabby, it would use the standard generator. In that case it was slightly faster than textgen in the same way as exllama vs exllama_HF. Currently that's 11.30 vs 11.45, it used to be more. I can also try with the old version, before dynamic gen because I backed it up when updating xformers support. Can also check cuda12 vs 11.8.

turboderp commented 4 months ago

Are you using Q4 cache for this?

Ph0rk0z commented 4 months ago

Yes, Q4 cache always. It's closer now with the latest commits. I see 11s at least.