Closed Desmond819 closed 1 month ago
@merrymercy I have ran into multiple instance of this as well. For me this can happen most often I ctrl+c cancel the model loading stage during some critical path but can never reproduce it correctly. Zombies are created as result and have to manually find/kill -9.
Due to the mp nature, maybe a ping/pong health check between the main/controller and the torch.mp processors should help solve this for good? So that all spawned processes auto exit if no pong/ping has been received in some N seconds. What do you think?
@Qubitium This has always been an issue for us. We tried https://github.com/sgl-project/sglang/blob/459abad2615e09f4e1bd28313b60fb1ada12c432/python/sglang/srt/managers/controller/manager_single.py#L155-L157 but it does not work well.
If you have any good methods to fix this, please send a PR!
Hi @Desmond819 @Qubitium This issue has been resolved thanks to @ispobock's fix (https://github.com/sgl-project/sglang/pull/666). Could you please try the latest release? I'll close this issue for now and will reopen it if you have any further questions. Thanks.
Checklist
Describe the bug
I use pm2 to run the server and it appears the python process is still running after the pm2 is killed, the GPUs were still occupied. How do I properly terminate the process?
Reproduction
pm2 start /usr/bin/python --name sglang-launch-server -- -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
Environment