Python Model timeouts after a few hours of successful Requests.

Description My issue is fairly similar to this one.

After a few hours of consecutive successful inference one of my python models (always the same) starts timeouting and a decrease in GPU memory usage is noticed. Unloading and reloading the model solves the issue. I don't see any Cuda errors priors to this. The model is still considered healthy and is still receiving requests although it is in no capacity to process them and returns only timeouts.

Triton Information I am using the container: nvcr.io/nvidia/tritonserver:22.09-py3 (we are currently blocked as all subsequent versions have yielded poorer performances due to TensorRT being slower. 23.06 is looking good on that side but we still have compilation issues which means we are stuck with 22.09 for now).

Are you using the Triton container or did you build it yourself?

Using the container and installed a few python requirements on top.

To Reproduce I can't reproduce it at whim but here is my setup:

We are running a python model with pytorch 1.12 (the following versions had DLPack issues). The python model is used to load the images from CPU to GPU, do a bit of pre/post processing and sending tensors to TensorRT models with the pb_util.Tensor object.

Expected behavior At the moment it's hard to understand where the issue is coming from. Clearer Errors or Triton noticing the model is not healthy would help.

Any advice / clues welcome :D

triton-inference-server / server

Python Model timeouts after a few hours of successful Requests. #6067