triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.08k stars 1.45k forks source link

ORT-TRT backend uses too much CPU memory #7180

Open ShuaiShao93 opened 4 months ago

ShuaiShao93 commented 4 months ago

Description When using ORT-TRT backend on GPU, the CPU memory usage is as high as the usage when we use CPU inference.

Triton Information What version of Triton are you using? 2.45.0

Are you using the Triton container or did you build it yourself? container

To Reproduce

Expected behavior The CPU memory usage should be very low when model uses ORT-TRT backend on GPU

ShuaiShao93 commented 4 months ago

A similar issue was reported before: https://github.com/triton-inference-server/server/issues/5392