Possibility to add multiple users / concurrent user requests?

huggingface / optimum-benchmark

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.

Apache License 2.0

217 stars 35 forks source link

Yes that's possible, it will have to be integrated on a backend level but for example if you look at the py-txi backend, you'll see that it has an async method (that's converted into a sync one for our batched inference scenario). That method can be used with a scenario that specifically targets server-like concurrency, that can have as configuration the number of concurrent users instead of batch size, etc.

Overall this will mostly require an InferenceServerScenario that implements the logic and some async methods (async_forward, async_generate, etc) in the backends that you wanna target.

I have already discussed this with @mht-sharma and it could be a great feature to compare server backends (TGI, vLLM, TRT-LLM) more adequately.

Would love to review a PR if this interests you.

huggingface / optimum-benchmark

Possibility to add multiple users / concurrent user requests? #222