Using TabbyAPI/exllamav2 with Llama3.1-8B
Threadripper Pro/A6000 GPU
Inference at ~70t/s unconstrained, single request. ~35t/s with lm-format-enforcer (JSON schema)
Running 30 simultaneous requests performance drops to ~1-2t/s, with CUDA use at ~10%. This does not occur if the lm-format-enforcer is not used (90-100% CUDA use with 10-20t/s on each request)
@turboderp has been able to replicate and suggests it is due to the large Llama3.1 vocabulary combined with the GIL forcing single-threaded behaviour.
Is this likely to be fixable or is it too complex? Thanks!
I think the correct way to approach this would probably be to use some multiprocessing / queue setup, but it would have to be deeply integrated with exllamav2.
Using TabbyAPI/exllamav2 with Llama3.1-8B Threadripper Pro/A6000 GPU
Inference at ~70t/s unconstrained, single request. ~35t/s with lm-format-enforcer (JSON schema) Running 30 simultaneous requests performance drops to ~1-2t/s, with CUDA use at ~10%. This does not occur if the lm-format-enforcer is not used (90-100% CUDA use with 10-20t/s on each request)
@turboderp has been able to replicate and suggests it is due to the large Llama3.1 vocabulary combined with the GIL forcing single-threaded behaviour.
Is this likely to be fixable or is it too complex? Thanks!