triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.08k stars 1.45k forks source link

Raise exception when falling back to pinned memory #5973

Open david-macleod opened 1 year ago

david-macleod commented 1 year ago

Is your feature request related to a problem? Please describe. Triton has a fallback mechanism for writing intermediates to pinned CPU memory when the CUDA memory pool is full. https://github.com/triton-inference-server/core/blob/main/src/memory.cc#L177

When using an ensemble model with large "intermediate" input/outputs, triggering this fallback can be catastrophic for performance, so we ensure enough memory CUDA memory is reserved upfront. Additionally for safety we also monitor the logs for the relevant warning to be raised if the fallback is triggered.

Describe the solution you'd like We would like the option for Triton server to raise an exception, rather than automatically falling back to the next level of the memory hierarchy, to avoid always having to wrap Triton server with log monitoring. This could potentially be a server CLI arg or an environment variable.

Describe alternatives you've considered Continue to monitor logs for the warning

kthui commented 1 year ago

Thanks for the enhancement suggestion. I have filed a ticket for us to investigate further. DLIS-5052

david-macleod commented 5 months ago

Is there any developments here?

If I was to contribute this change would it be considered? Would an environment variable or a CLI arg be more appropriate here for disabling pinned memory fallback?