triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.29k stars 1.48k forks source link

Python backend SHM memory leak #7481

Open mbahri opened 3 months ago

mbahri commented 3 months ago

Description I am encountering two possibly related issues with the Python backend and shared memory:

  1. During operation, the shared memory usage keeps growing, leading to errors. It looks like the shared memory regions allocated by the Python backend for its inputs are not recycled. I understand the SHM region grows based on the size of the inputs, but this is an issue especially when multiple model instances are running. Also, it is possible the region grows beyond the largest input if memory is leaked instead of re-used.
  2. After the Triton container is terminated, allocated shared memory regions remain in /dev/shm

Triton Information What version of Triton are you using? 2.47.0

Are you using the Triton container or did you build it yourself? Official containers:

To Reproduce I encountered the issue with any Python-based model I tried:

Expected behavior

  1. Shm regions would be shrunk, or at least wouldn't grow indefinitely (arena-style allocator?)
  2. Shm regions would be de-allocated when the model shuts down
rmccorm4 commented 3 months ago

Hi @mbahri,

Do you have a minimal model, client, and steps you could share for reproducing to help expedite debugging? If it is a generic python backend shm issue, then a simple python model not doing anything interesting (identity, etc.) may be able to reproduce it.

CC @Tabrizian @kthui @krishung5 for viz

rodrigo-orlandini commented 2 months ago

Hi everyone,

@mbahri, has it already solved? If it has, you could provide an explanation about solution?

I'm facing a similar problem here. We've already wrote a github issue pointing this problem and a ticket was opened, but it is still occurring and we don't have any solution.

@rmccorm4, there are some steps and metrics that could be used to reproduce and analyse the problem. You could check it there: https://github.com/triton-inference-server/server/issues/6720

fangpings commented 1 month ago

We are facing the same issues in our models. Any more updates on this?

Also for the second issue where /dev/shm will not be cleaned after container restarts. If you are in k8s environment, we've used a hacky way to clean it once the container restarted so at least container won't end up CLBO because it has no memory available

                  "lifecycle": {
                     "postStart": {
                        "exec": {
                           "command": ["/bin/sh", "-c", "rm -f /dev/shm/*"]
                        }
                     }
                  },
ash2703 commented 1 month ago

Facing a similar issue when deploying on k8s SHM grows and pod is killed with OOM

Do not encounter this when testing without k8s

lakshbhasin commented 2 days ago

Hello @Tabrizian @kthui @krishung5, I have also been running into the same issue with SHM memory leak on Triton 24.04. I noticed this only began when I switched my ensemble model to BLS to add more custom branching. As other commenters have noted, /dev/shm/ fills up and has to be manually cleared between container restarts for me to mitigate the memory leak.

I have attached a valgrind log file with more details from some warmup requests: triton_valgrind.log

You can see in this log file a lot of logs that are specific to BLS (see ExecuteBLSRequest) and shared memory (SaveRequestsToSharedMemory). For example:

==63== Use of uninitialised value of size 8
==63==    at 0x606645F: pthread_cond_broadcast@@GLIBC_2.3.2 (pthread_cond_broadcast.c:76)
==63==    by 0x7F17F29: triton::backend::python::ModelInstanceState::ExecuteBLSRequest(std::shared_ptr<triton::backend::python::IPCMessage>, bool) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F1915C: std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<triton::backend::python::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int, std::vector<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> >, std::allocator<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> > > >&, bool&)::{lambda()#3}, std::allocator<int>, void ()>::_M_run()::{lambda()#1}, void> >::_M_invoke(std::_Any_data const&) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F22D1C: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x60674DE: __pthread_once_slow (pthread_once.c:116)
==63==    by 0x7F05378: std::__future_base::_Task_state<triton::backend::python::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int, std::vector<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> >, std::allocator<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> > > >&, bool&)::{lambda()#3}, std::allocator<int>, void ()>::_M_run() (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F3AED1: boost::asio::detail::executor_op<boost::asio::detail::binder0<std::packaged_task<void ()> >, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F2B91B: boost::asio::detail::scheduler::run(boost::system::error_code&) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F2BE8C: boost::asio::detail::posix_thread::func<boost::asio::thread_pool::thread_function>::run() (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F1F5F3: boost_asio_detail_posix_thread_function (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x605E608: start_thread (pthread_create.c:477)
==63==    by 0x657A352: clone (clone.S:95)
==63==  Uninitialised value was created by a heap allocation
==63==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==63==    by 0x7F16112: triton::backend::python::ModelInstanceState::SaveRequestsToSharedMemory(TRITONBACKEND_Request**, unsigned int, std::vector<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> >, std::allocator<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> > > >&, triton::backend::python::AllocatedSharedMemory<char>&, std::shared_ptr<std::vector<TRITONBACKEND_Response*, std::allocator<TRITONBACKEND_Response*> > >&) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F1B7B5: triton::backend::python::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int, std::vector<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> >, std::allocator<std::unique_ptr<triton::backend::python::InferRequest, std::default_delete<triton::backend::python::InferRequest> > > >&, bool&) (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x7F1E212: TRITONBACKEND_ModelInstanceExecute (in /opt/tritonserver/backends/python/libtriton_python.so)
==63==    by 0x5140724: triton::core::TritonModelInstance::Execute(std::vector<TRITONBACKEND_Request*, std::allocator<TRITONBACKEND_Request*> >&) (in /opt/tritonserver/lib/libtritonserver.so)
==63==    by 0x51409DA: triton::core::TritonModelInstance::Schedule(std::vector<std::unique_ptr<triton::core::InferenceRequest, std::default_delete<triton::core::InferenceRequest> >, std::allocator<std::unique_ptr<triton::core::InferenceRequest, std::default_delete<triton::core::InferenceRequest> > > >&&) (in /opt/tritonserver/lib/libtritonserver.so)
==63==    by 0x523D57C: triton::core::Payload::Execute(bool*) (in /opt/tritonserver/lib/libtritonserver.so)
==63==    by 0x514493A: triton::core::TritonModelInstance::TritonBackendThread::BackendThread() (in /opt/tritonserver/lib/libtritonserver.so)
==63==    by 0x615F792: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.32)
==63==    by 0x605E608: start_thread (pthread_create.c:477)
==63==    by 0x657A352: clone (clone.S:95)

This is from a run where I loaded the modules retrieval_bls, signal_client, and inference_retrieval_ensemble. My BLS logic is to essentially run signal_client and if it succeeds run inference_retrieval_ensemble. If signal_client fails, I return early with an empty response.

I can't provide you a fully end-to-end reproducible example at the moment. But I have included below a simplified version of retrieval_bls's model.py and config.pbtxt in case they help.

I hope the above valgrind logs are sufficient for you to debug this issue? Thanks a lot for your help.

Edit to add: I undid the BLS change and switched back to ensemble. Instead of using BLS's branching logic to exit early, I just propagated through empty data through the ensemble until the last stage. This is less efficient as it means later stages and queuing apply instead of exiting early. However, it has solved the memory leak. See the graphs below where we used to see an increasing memory usage pattern until the process restarted due to OOMkiller. After the change was rolled out, memory usage is mostly flat and there is no more OOM killer. memory_leak_fixed