triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.27k stars 1.47k forks source link

Python Model timeouts after a few hours of successful Requests. #6067

Closed MatthieuToulemont closed 1 year ago

MatthieuToulemont commented 1 year ago

Description My issue is fairly similar to this one.

After a few hours of consecutive successful inference one of my python models (always the same) starts timeouting and a decrease in GPU memory usage is noticed. Unloading and reloading the model solves the issue. I don't see any Cuda errors priors to this. The model is still considered healthy and is still receiving requests although it is in no capacity to process them and returns only timeouts.

Triton Information I am using the container: nvcr.io/nvidia/tritonserver:22.09-py3 (we are currently blocked as all subsequent versions have yielded poorer performances due to TensorRT being slower. 23.06 is looking good on that side but we still have compilation issues which means we are stuck with 22.09 for now).

Are you using the Triton container or did you build it yourself?

To Reproduce I can't reproduce it at whim but here is my setup:

Expected behavior At the moment it's hard to understand where the issue is coming from. Clearer Errors or Triton noticing the model is not healthy would help.

Any advice / clues welcome :D

MatthieuToulemont commented 1 year ago

Issue solved some of my users were sending images that were too big