Reduced performance/bottleneck with concurrent requests and Llama-3.1

Using TabbyAPI/exllamav2 with Llama3.1-8B Threadripper Pro/A6000 GPU

Inference at ~70t/s unconstrained, single request. ~35t/s with lm-format-enforcer (JSON schema) Running 30 simultaneous requests performance drops to ~1-2t/s, with CUDA use at ~10%. This does not occur if the lm-format-enforcer is not used (90-100% CUDA use with 10-20t/s on each request)

@turboderp has been able to replicate and suggests it is due to the large Llama3.1 vocabulary combined with the GIL forcing single-threaded behaviour.

Is this likely to be fixable or is it too complex? Thanks!

noamgat / lm-format-enforcer

Reduced performance/bottleneck with concurrent requests and Llama-3.1 #127