triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.08k stars 1.45k forks source link

Server should exit on unrecoverable errors in underlying runtime that cannot be resolved by model reloading #5765

Open tuxedocat opened 1 year ago

tuxedocat commented 1 year ago

Description

The model is not reloaded when the underlying backend runtime, pytorch_backend and libtorch in this case, causes some errors.

In such cases, it would be useful in a production environment if either:

Note that in my case, the model runs with the PyTorch backend, and the error shown below is caused by a CUDA memory access error in libtorch.cc. However, this is essentially out of the scope of this issue.

Here's an actual case that occurred:

Excerpt of log First, I got the following error from libtorch.cc, which is from the underlying CUDA runtime:

Failed to capture elapsed time: Internal - Failed to capture elapsed time: an illegal memory access was encountered`

Then, the same log entry repeatedly occurred until the Triton pod on k8s was manually restarted.

NOTES:

TLDR: The server should handle backend internal error.

Triton Information

nvcr.io/nvidia/tritonserver:23.03-py3

To Reproduce

This issue may not be specific to the model, but the settings below is used in our case:

Steps to reproduce the behavior:

Expected behavior

In the case above, the desired behavior of the tritonserver would be to "exit if unrecoverable." Then, the liveness probe would detect that the pod is unhealthy, and the pod would be restarted automatically.

As a sidenote and in this pytorch backend specific case, it seems we need to handle error in backend too.

seoungbae-park commented 1 year ago

I have the same error on k8s . Any solution?

tuxedocat commented 1 year ago

Seems this issue has been treated as backend specific error, but I thought that this is more like design discussion for error handling in the server:

As I wrote in the description TLDR: The server should handle backend internal error.

In the meanwhile, we need to detect error somehow e.g. via log output and trigger restart.

MatthieuToulemont commented 1 year ago

I agree 100%, Triton should handle this.

I am currently having issues where a Cuda Kernel error is triggered on a BLS model that uses torch inside. The model is still considered READY by Triton and can still process requests, however all following requests timeout :/

@tuxedocat does unloading then reloading the model fixes your issue or you need to restart the full server ?

tuxedocat commented 1 year ago

@tuxedocat does unloading then reloading the model fixes your issue or you need to restart the full server ?

Technically yes, reloading the model is sufficient for my case which uses explicit loading mode. However in production uses, we would need both model-manager side and server-side error handling.

fsatka commented 6 months ago

Somebody solve it? Maybe upgrade of triton version help you to eluminate this error?

yutkin commented 2 weeks ago

We still have this in 24.07 version