Triton Inference Server outage

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

BSD 3-Clause "New" or "Revised" License

7.74k stars 1.42k forks source link

Triton Inference Server outage #7038

Open tatsianaDr opened 3 months ago

tatsianaDr commented 3 months ago

Description The Triton Inference server is deployed on the only CPU device. There are about 32 models (onnxruntime).

The Triton Inference server outage during the long load testing. It stops to respond in 5-10 minutes. Load testing is about 20 concurrent users for full service. Users send requests to different models. It is not a resource issue - the CPU and memory usage is not high. No information in the triton logs (with verbose 2), only stop to respond. The /ready endpoint also does not return any responce.

Triton Information nvcr.io/nvidia/tritonserver:24.01-py3

pvijayakrish commented 3 months ago

@tatsianaDr Thanks for reaching out. I would like to get more information. Some questions I have are as follows:

Could you provide more information including I/O resource utilization
the config files for the different models.
Is there any request dependent on other requests/resources?
Is it possible to isolate the request that leads to the unresponsiveness in server? Does this request always leads to this unresponsive behavior?

tatsianaDr commented 3 months ago

Hi @pvijayakrish I have metrics collected from the aws:

And the configuration files are the same for all models: instance_group [ { count: 1 kind: KIND_CPU } ] version_policy: {all {}} parameters { key: "enable_mem_arena" value: { string_value: "0" } } parameters { key: "enable_mem_pattern" value: { string_value: "0" } } max_batch_size: 15 (or 1 - depending on model)

There are no ensemble models and no dependencies. It looks like the issue is rate (speed) - but I have not found a way to make the limit rate on the triton side. And it makes about 3 requests per second for every model (at the same time it is tested about 16 models)

and on the isolate requests - I run all requests on concurrent mode for load test the service. We have to be sure that it will be safe for the massive number of users.