[Bug] process not terminated after PM2 is kill

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Apache License 2.0

4.96k stars 343 forks source link

[Bug] process not terminated after PM2 is kill #680

Closed Desmond819 closed 1 month ago

Desmond819 commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

I use pm2 to run the server and it appears the python process is still running after the pm2 is killed, the GPUs were still occupied. How do I properly terminate the process?

Reproduction

pm2 start /usr/bin/python --name sglang-launch-server -- -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000

Environment

N/A

merrymercy commented 1 month ago

try this https://github.com/sgl-project/sglang/blob/main/test/killall_sglang.sh

Qubitium commented 1 month ago

@merrymercy I have ran into multiple instance of this as well. For me this can happen most often I ctrl+c cancel the model loading stage during some critical path but can never reproduce it correctly. Zombies are created as result and have to manually find/kill -9.

Due to the mp nature, maybe a ping/pong health check between the main/controller and the torch.mp processors should help solve this for good? So that all spawned processes auto exit if no pong/ping has been received in some N seconds. What do you think?

merrymercy commented 1 month ago

@Qubitium This has always been an issue for us. We tried https://github.com/sgl-project/sglang/blob/459abad2615e09f4e1bd28313b60fb1ada12c432/python/sglang/srt/managers/controller/manager_single.py#L155-L157 but it does not work well.

If you have any good methods to fix this, please send a PR!

zhyncs commented 1 month ago

Hi @Desmond819 @Qubitium This issue has been resolved thanks to @ispobock's fix (https://github.com/sgl-project/sglang/pull/666). Could you please try the latest release? I'll close this issue for now and will reopen it if you have any further questions. Thanks.