triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

`pb_utils.TritonError.NOT_HEALTHY` Error Code #6763

Open khaykingleb opened 9 months ago

khaykingleb commented 9 months ago

Is your feature request related to a problem? Please describe. The problem arises when handling a model's health issues (e.g. lack of CPU RAM). Currently, the error codes available (such as pb_utils.TritonError.UNKNOWN, pb_utils.TritonError.INTERNAL, etc.) do not specifically address the issue. When the problem occurs, the error in tritonclient.utils.InferenceServerException like [StatusCode.INTERNAL] in ensemble 'some_ensemble', Failed to process the request(s) for model instance 'some_model_0_0', message: Stub process 'some_model_0_0' is not healthy is received. This lack of specificity in error codes makes it challenging to implement efficient error handling, particularly when using NVIDIA Triton Inference Server with outside systems enabled with auto-retry handling (say, Celery).

Describe the solution you'd like I propose the introduction of a new error code: pb_utils.TritonError.NOT_HEALTHY. This error code would specifically indicate issues related to the health of a model, such as CPU RAM problems. With this specific error code, I could implement more targeted error handling strategies, such as auto-retrying requests to the NVIDIA Triton, knowing that the stub will be reinitialized subsequently. Alternatively, the error code pb_utils.TritonError.UNAVAILABLE could be raised specifically for model health issues.

Describe alternatives you've considered The current alternative is to use the existing, more generalized error codes. However, this approach lacks precision and may lead to unnecessary auto-retries for various issues, resulting in a high rate of false positives.

Tabrizian commented 9 months ago

@khaykingleb Thanks for your feature request. I think this is a reasonable request. @krishung5 Do you have any thoughts regarding this request?

krishung5 commented 9 months ago

I think that this request will help users to better handle the model health issues. Filed a feature request ticket (DLIS-6039).

sboudouk commented 4 months ago

Hello.

Any news on this feature ? I'm encountering issues on production with kube + triton server and the "not healthy" log isn't really helping me, I don't know what's wrong.

Thanks :)