triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
665 stars 96 forks source link

Stress testing, concurrent testing #248

Open SDcodehub opened 9 months ago

SDcodehub commented 9 months ago

I have few models with different options but from tensorrt-llm.

I want to stress test these using triton.

At present I am using the ci script provided in triton repo, however these requests are one after other,

I want to test concurrently how it can perform, do we have any scripts to run these tests?

juney-nvidia commented 9 months ago

@SDcodehub What do you mean that "concurrent testing" here?

Can you elaborate your requirement here?

Thanks June

ekarmazin commented 9 months ago

Basically have the same question, we need to validate performance based on available GPUs on production to plan our scaling accordingly. As @SDcodehub mentioned we are also looking for the best way to do concurrent requests instead of sequential provided by the example script. Any recommendations or suggestions? Currently we are using custom python script for Locust, but that is not a full picture we are looking for.

SDcodehub commented 9 months ago

@ekarmazin aligned we also tried locust, good to some extent however not sure if can cover full picture.

Any reading material on this direction can also help.

schetlur-nv commented 8 months ago

Have you tried using https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/tools/inflight_batcher_llm/end_to_end_test.py? This has a 'concurrency' parameter that allows you to issue multiple requests in parallel.

SDcodehub commented 8 months ago

@schetlur-nv Thanks for reply, do we have a document around it? I have basic understanding in testing api response for latency and throughput, so I would reply on other experts, does this script suffice all the testing required?