Open david-macleod opened 1 year ago
Thanks for the enhancement suggestion. I have filed a ticket for us to investigate further. DLIS-5052
Is there any developments here?
If I was to contribute this change would it be considered? Would an environment variable or a CLI arg be more appropriate here for disabling pinned memory fallback?
Is your feature request related to a problem? Please describe. Triton has a fallback mechanism for writing intermediates to pinned CPU memory when the CUDA memory pool is full. https://github.com/triton-inference-server/core/blob/main/src/memory.cc#L177
When using an ensemble model with large "intermediate" input/outputs, triggering this fallback can be catastrophic for performance, so we ensure enough memory CUDA memory is reserved upfront. Additionally for safety we also monitor the logs for the relevant warning to be raised if the fallback is triggered.
Describe the solution you'd like We would like the option for Triton server to raise an exception, rather than automatically falling back to the next level of the memory hierarchy, to avoid always having to wrap Triton server with log monitoring. This could potentially be a server CLI arg or an environment variable.
Describe alternatives you've considered Continue to monitor logs for the warning