ORT-TRT backend uses too much CPU memory

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

BSD 3-Clause "New" or "Revised" License

8.08k stars 1.45k forks source link

Description When using ORT-TRT backend on GPU, the CPU memory usage is as high as the usage when we use CPU inference.

Triton Information What version of Triton are you using? 2.45.0

Are you using the Triton container or did you build it yourself? container

To Reproduce

Expected behavior The CPU memory usage should be very low when model uses ORT-TRT backend on GPU

triton-inference-server / server