Closed hzlushiliang closed 1 month ago
@hzlushiliang we have completely rewritten this code path in 0.10 release to be based on the executor
API. Can you please re-try in a couple of weeks when the 0.10 release is public? We would prefer not to fix an issue with the old GptManager
path.
Feel free to reopen this issue if you have further questions.
Environment
. CPU architecture: x86_64 . CPU/Host memory size: 16G . GPU properties
. Libraries
. NVIDIA driver version: 525.105.17 . OS: Centos8
Reproduction Steps
. I have a inference server developed directly based upon triton-core, with similar functionality as triton-server, but serving through another protocol, not gRPC, not Http . The process try to exit normally, TRITONSERVER_ServerDelete was invoked
Expected Behavior
. The process should exit normally, gracefully
Actual Behavior
. Coredump happened with stack below
Additional Notes
As showed by above stack, it seems to be a problem of destructive misorder when process exit, detail seems to be
. When model is loading,
ModelInstanceState
is created, which create aGptManager
member instance, code switch into libtensorrt_llm_batch_manager_static library, inside which, a new thread is created to execute funcdecoupled_execution_loop
, again inside which,ModelInstanceState
instance is referenced to invokedget_inference_requests
. When process try to exit, membermWorkItemsQueue
ofModelInstanceState
is destructived before membermBatchManager
, still the child thread is reference membermWorkItemsQueue
, leading to a coredumpI manually modified the source code of tensorrtllm_backend
above coredump stack is no longer seen, but new coredump stack is showing up
It seems the new coredump is happened inside
mpi
resources, this reminds me of a summarized problem ofdestructive misorder when process exit
, can anybody check through this? can tensorrtllm_backend exit gracefully?thx all.