triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.31k stars 1.48k forks source link

Signal 6 or Signal 11 from python backend. #6800

Open kbegiedza opened 9 months ago

kbegiedza commented 9 months ago

Description

In k8s cluster I have with multiple GPUs and a single Triton server's pod with multiple models including BLS based models.

Sometimes under heavy pressure triton restarts with Signal 6 or Signal 11 error (trace logs below)

I can observe that right before crash server allocates 2x RAM:

image

Triton Information nvcr.io/nvidia/tritonserver:23.02-py3

To Reproduce Unknown ?

Full log below: triton-server.log

Expected behavior Stable execution.

nv-kmcgill53 commented 9 months ago

Hi @kbegiedza, as a preliminary check, can you see if you can replicate this behavior on our latest container nvcr.io/nvidia/tritonserver:23.12-py3? Thanks.

Fleyderer commented 9 months ago

We have the same problem with every 23.XX version of Triton Server. Interesting fact is that we have signal 11 error with one GPU, and sometimes signal 6 with 2 GPU's