mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.58k stars 1.5k forks source link

[Bug] Engine restarted with idle process #2695

Closed Desmond819 closed 2 weeks ago

Desmond819 commented 1 month ago

I use pm2 to run the mlc-llm server and after it's running for 2 days. I start to get this error and the server will restart. But after restarted, there will be idle python process occupying 100% of the GPU and make the speed become very slow. Is there anyway to resolve it?

image
tqchen commented 1 month ago

Did pm2 kill the original engine? The cancelled error in engine happens when requests get cancelled (e.g. you send a chat completion request, but did not iterate over all of it, so the server side decide to cancel the request), but the original engine should continue to be functioning.

if you can find a way to reproduce the error we would be happy to dig deeper

Desmond819 commented 1 month ago

I run 2 GPU so literally 2 python programs was running, after the engine restarted, pm2 only killed 1 program, causing overwhelming to the GPU. I can't myself reproduce the error as I get completion requests from the blockchain, but this happens occasionally after the server is running for sometime. Here is how it looks like when this issue happen. the pid 2161 is the idle process which huge memory usage and not release.

image
Desmond819 commented 1 month ago

I think when the memory usage goes very high, it will start to get error and pm2 will restart the process but without killing one of them (created for parallel tensor)

tqchen commented 1 month ago

Would be nice if you can confirm there is a local command to reproduce the issue. On latest mlc engine, I tried starting the engine with tp=2, and manually run kill the main engine process, the extra process did get released as well. Not sure if it is related to how pm2 killed the process