Open kbegiedza opened 9 months ago
Hi @kbegiedza, as a preliminary check, can you see if you can replicate this behavior on our latest container nvcr.io/nvidia/tritonserver:23.12-py3
? Thanks.
We have the same problem with every 23.XX
version of Triton Server. Interesting fact is that we have signal 11 error with one GPU, and sometimes signal 6 with 2 GPU's
Description
In k8s cluster I have with multiple GPUs and a single Triton server's pod with multiple models including BLS based models.
Sometimes under heavy pressure triton restarts with Signal 6 or Signal 11 error (trace logs below)
I can observe that right before crash server allocates 2x RAM:
Triton Information
nvcr.io/nvidia/tritonserver:23.02-py3
To Reproduce Unknown ?
Full log below: triton-server.log
Expected behavior Stable execution.