Open jsoto-gladia opened 1 year ago
Thank you @jsoto-gladia for reporting this issue, I filed a ticket for our team to investigate.
@oandreeva-nv, what is the number of the ticket you opened? Can you share more about why we are filing a ticket for this? Triton does not have a way to know that a model has been made to call another model. BLS is meant to cover these types of use cases.
Is this more of an enhancement to allow users to specify which models are required for other models in their model configurations so that Triton knows not to shut them down until the other models have no more requests in flight?
I believe this issue asks us to make sure that during graceful shutdown of Triton Inference Server, we properly handle in-flight requests, i.e. instead of returning an error to the client, we return that Server was shut down during this request. There was one more issue opened similar to this, and that was marked as Feature Request. This issue relates to this ticket: https://jirasw.nvidia.com/browse/DLIS-5458, which is linked to feature request ticket
Thanks for providing the number! I see. For some reason, I thought the expected behavior was to start unloading models as soon as they have no ongoing requests. It sounds like Triton does not start unloading models until there are no in-flight requests or the timeout is reached. I appreciate you clarifying!
Issue Description: During a graceful shutdown of Triton Server, we've observed the following behavior:
Triton Server is hosting both Model A and Model B.
Model B can make calls to Model A.
If there is one in-flight inference request for Model B, Triton Server shuts down Model A.
An error is produced when Model B tries to call Model A that is in a not ready state.
Expected Behavior: During a graceful shutdown, Triton Server should handle in-flight inference requests for Model B gracefully and not prematurely shut down Model A.
Steps to Reproduce:
Actual Results: Model A is shut down prematurely, leading to errors when Model B tries to make further calls to it.