triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8k stars 1.44k forks source link

RAM memory growth of triton server, until killed by OS #7035

Open InfiniteLife opened 5 months ago

InfiniteLife commented 5 months ago

Im using nvcr.io/nvidia/tritonserver:23.10-py3 container for my inferencing, using C++ GRPC API. There is several models in container, Yolov8-like architecture in Tensorrt plus a few Torchscript models. When inferencing I notice linear growth of RAM memory consumption of triton server, starting from 12-15 GB and linearly growing to up to 80 GB after 12 hours of constant inferencing, and growing further until, seems like, getting killed by OS OOM killer(Ubuntu 22.04).

Triton Server model load mode is default(not explicit), in inferencing API no shared memory is used, Sync and Async calls are made.

My question is what is good way to debug it?

pvijayakrish commented 5 months ago

@InfiniteLife Is it possible to separately run TensorRT and PyTorch models. If the issue goes away for either one, that will help narrow down the issue to a single backend.

InfiniteLife commented 5 months ago

My last experiment with config for all Torchscript models with parameter:

parameters: {
key: "ENABLE_CACHE_CLEANING"
    value: {
    string_value:"true"
    }
}

shown no memory growth. Seems like clearing cache helps. Pytorch is wierd.

pvijayakrish commented 5 months ago

@InfiniteLife Could you please share the steps to reproduce the issue?

InfiniteLife commented 5 months ago

I will need share models for that, but I cannot. But as I mentioned issue is gone if cache clearing is enabled for pytorch.