huggingface / optimum-benchmark

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
Apache License 2.0
217 stars 35 forks source link

Possibility to add multiple users / concurrent user requests? #222

Open mgiessing opened 3 weeks ago

mgiessing commented 3 weeks ago

Hi there :-)

Is there a possibility to configure multiple users / concurrent request sessions? I'd like to simulate how the different backends behave if not 1 user, but e.g. 8 users concurrently access the LLM.

I know there is the possibility to configure batches, but there should be a performance difference if e.g. 1 user sends a batch with 8 requests or 8 users independently send a batch with 1 request each. Please correct me if that is not true :-)

Thanks a lot and appreciate the work on optimum-benchmark!

IlyasMoutawwakil commented 3 weeks ago

Yes that's possible, it will have to be integrated on a backend level but for example if you look at the py-txi backend, you'll see that it has an async method (that's converted into a sync one for our batched inference scenario). That method can be used with a scenario that specifically targets server-like concurrency, that can have as configuration the number of concurrent users instead of batch size, etc.

Overall this will mostly require an InferenceServerScenario that implements the logic and some async methods (async_forward, async_generate, etc) in the backends that you wanna target.

I have already discussed this with @mht-sharma and it could be a great feature to compare server backends (TGI, vLLM, TRT-LLM) more adequately.

Would love to review a PR if this interests you.