Server should exit on unrecoverable errors in underlying runtime that cannot be resolved by model reloading

tuxedocat commented 1 year ago

Description

The model is not reloaded when the underlying backend runtime, pytorch_backend and libtorch in this case, causes some errors.

In such cases, it would be useful in a production environment if either:

(a) the failed model is reloaded by the server, and the server continues running
or (b) the server exits with an error and is caught by an external mechanism, such as a liveness probe in k8s, to prevent further incoming requests.

Note that in my case, the model runs with the PyTorch backend, and the error shown below is caused by a CUDA memory access error in libtorch.cc. However, this is essentially out of the scope of this issue.

Here's an actual case that occurred:

Excerpt of log First, I got the following error from libtorch.cc, which is from the underlying CUDA runtime:

Failed to capture elapsed time: Internal - Failed to capture elapsed time: an illegal memory access was encountered`

Then, the same log entry repeatedly occurred until the Triton pod on k8s was manually restarted.

NOTES:

The image is based on nvcr.io/nvidia/tritonserver:23.03-py3, and added some packages via Dockerfile for Python backend models
Runs on GKE, on T4 GPU Nodepool
Uses Triton's gRPC health check for both readiness/liveness probes of k8s
- Default gRPC health check cannot catch this type of internal error.

TLDR: The server should handle backend internal error.

Triton Information

nvcr.io/nvidia/tritonserver:23.03-py3

To Reproduce

This issue may not be specific to the model, but the settings below is used in our case:

backend
- pytorch_backend
model load policy
- explicit
- Using tcmalloc as advised in the documentation

Steps to reproduce the behavior:

Somehow cause an internal error (e.g., a CUDA memory access error) in the PyTorch Backend.
Observe how it is handled by the model repository manager

Expected behavior

In the case above, the desired behavior of the tritonserver would be to "exit if unrecoverable." Then, the liveness probe would detect that the pod is unhealthy, and the pod would be restarted automatically.

As a sidenote and in this pytorch backend specific case, it seems we need to handle error in backend too.

seoungbae-park commented 1 year ago

I have the same error on k8s . Any solution?

tuxedocat commented 1 year ago

Seems this issue has been treated as backend specific error, but I thought that this is more like design discussion for error handling in the server:

As I wrote in the description TLDR: The server should handle backend internal error.

In the meanwhile, we need to detect error somehow e.g. via log output and trigger restart.

MatthieuToulemont commented 1 year ago

I agree 100%, Triton should handle this.

I am currently having issues where a Cuda Kernel error is triggered on a BLS model that uses torch inside. The model is still considered READY by Triton and can still process requests, however all following requests timeout :/

@tuxedocat does unloading then reloading the model fixes your issue or you need to restart the full server ?

tuxedocat commented 1 year ago

@tuxedocat does unloading then reloading the model fixes your issue or you need to restart the full server ?

Technically yes, reloading the model is sufficient for my case which uses explicit loading mode. However in production uses, we would need both model-manager side and server-side error handling.

fsatka commented 6 months ago

Somebody solve it? Maybe upgrade of triton version help you to eluminate this error?

yutkin commented 2 weeks ago

We still have this in 24.07 version

triton-inference-server / server