Open tatsianaDr opened 3 months ago
@tatsianaDr Thanks for reaching out. I would like to get more information. Some questions I have are as follows:
Hi @pvijayakrish
I have metrics collected from the aws:
And the configuration files are the same for all models: instance_group [ { count: 1 kind: KIND_CPU } ] version_policy: {all {}} parameters { key: "enable_mem_arena" value: { string_value: "0" } } parameters { key: "enable_mem_pattern" value: { string_value: "0" } } max_batch_size: 15 (or 1 - depending on model)
There are no ensemble models and no dependencies. It looks like the issue is rate (speed) - but I have not found a way to make the limit rate on the triton side. And it makes about 3 requests per second for every model (at the same time it is tested about 16 models)
and on the isolate requests - I run all requests on concurrent mode for load test the service. We have to be sure that it will be safe for the massive number of users.
Description The Triton Inference server is deployed on the only CPU device. There are about 32 models (onnxruntime).
The Triton Inference server outage during the long load testing. It stops to respond in 5-10 minutes. Load testing is about 20 concurrent users for full service. Users send requests to different models. It is not a resource issue - the CPU and memory usage is not high. No information in the triton logs (with verbose 2), only stop to respond. The /ready endpoint also does not return any responce.
Triton Information nvcr.io/nvidia/tritonserver:24.01-py3