triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.31k stars 1.48k forks source link

Model Repository Freeze on Python Model Errors in Polling Mode #6710

Open teith opened 11 months ago

teith commented 11 months ago

Description Encountered a critical issue with Triton Inference Server in poll mode where the server becomes unresponsive when loading a Python model with errors. Specifically, if a Python model has an import error (e.g., ModuleNotFoundError: No module named 'transformers'), Triton logs this error and then stops processing any further interactions with the model repository, making it impossible to unload this model or load new models.

Triton Information Version: tritonserver:23.11-py3 Using the Triton container

System Information Device: MacBook Pro 14" M2 Pro OS Sonoma 14.1

To Reproduce Steps to reproduce the behavior:

Start Triton Inference Server with --model-repository=/ops/model_repository --model-control-mode=poll --repository-poll-secs=1 --exit-on-error false. Add a Python model to model_repository that has a missing module (e.g., transformers). Observe the error in the server logs and the subsequent inability to interact further with the model repository.

Expected behavior The server is expected to handle such errors gracefully, logging them but still maintaining the ability to manage (load/unload) other models. The polling mechanism should continue to function and allow updates to the model repository without a complete server halt.

D1-3105 commented 11 months ago

+1

teith commented 11 months ago

In addition to the previously mentioned issue in poll mode, a similar problem occurs in explicit mode. When using the python tritonclient to execute load_model for an incorrect model, the interaction with the model repository freezes in the same way. The load_model method ends with a TimeoutError: timed out. After this, it becomes impossible to unload the faulty model using unload_model, and new models, even correct ones, cannot be loaded.

denti commented 11 months ago

+1

oandreeva-nv commented 11 months ago

I was able to reproduce this issue on my side. Pleas, note that we don't officially support MAC, but this is reproducible on Linux. I've created a ticket for the team. [Bug: 5944]

teith commented 7 months ago

Pleas, fix it 🙏🏻