noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.38k stars 60 forks source link

Reduced performance/bottleneck with concurrent requests and Llama-3.1 #127

Open thigger opened 1 month ago

thigger commented 1 month ago

Using TabbyAPI/exllamav2 with Llama3.1-8B Threadripper Pro/A6000 GPU

Inference at ~70t/s unconstrained, single request. ~35t/s with lm-format-enforcer (JSON schema) Running 30 simultaneous requests performance drops to ~1-2t/s, with CUDA use at ~10%. This does not occur if the lm-format-enforcer is not used (90-100% CUDA use with 10-20t/s on each request)

@turboderp has been able to replicate and suggests it is due to the large Llama3.1 vocabulary combined with the GIL forcing single-threaded behaviour.

Is this likely to be fixable or is it too complex? Thanks!

noamgat commented 1 week ago

I think the correct way to approach this would probably be to use some multiprocessing / queue setup, but it would have to be deeply integrated with exllamav2.