Premature shutdown of model during graceful shutdown

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

BSD 3-Clause "New" or "Revised" License

8.32k stars 1.48k forks source link

Premature shutdown of model during graceful shutdown #6287

Open jsoto-gladia opened 1 year ago

jsoto-gladia commented 1 year ago

Issue Description: During a graceful shutdown of Triton Server, we've observed the following behavior:

Triton Server is hosting both Model A and Model B.
Model B can make calls to Model A.
If there is one in-flight inference request for Model B, Triton Server shuts down Model A.
An error is produced when Model B tries to call Model A that is in a not ready state.

Expected Behavior: During a graceful shutdown, Triton Server should handle in-flight inference requests for Model B gracefully and not prematurely shut down Model A.

Steps to Reproduce:

Start Triton Server with Model A and Model B loaded.
Send an inference request to Model B.
Initiate a graceful shutdown of Triton Server.

Actual Results: Model A is shut down prematurely, leading to errors when Model B tries to make further calls to it.

oandreeva-nv commented 1 year ago

Thank you @jsoto-gladia for reporting this issue, I filed a ticket for our team to investigate.

dyastremsky commented 1 year ago

@oandreeva-nv, what is the number of the ticket you opened? Can you share more about why we are filing a ticket for this? Triton does not have a way to know that a model has been made to call another model. BLS is meant to cover these types of use cases.

Is this more of an enhancement to allow users to specify which models are required for other models in their model configurations so that Triton knows not to shut them down until the other models have no more requests in flight?

oandreeva-nv commented 1 year ago

I believe this issue asks us to make sure that during graceful shutdown of Triton Inference Server, we properly handle in-flight requests, i.e. instead of returning an error to the client, we return that Server was shut down during this request. There was one more issue opened similar to this, and that was marked as Feature Request. This issue relates to this ticket: https://jirasw.nvidia.com/browse/DLIS-5458, which is linked to feature request ticket

dyastremsky commented 1 year ago

Thanks for providing the number! I see. For some reason, I thought the expected behavior was to start unloading models as soon as they have no ongoing requests. It sounds like Triton does not start unloading models until there are no in-flight requests or the timeout is reached. I appreciate you clarifying!