Closed tswangdi closed 1 year ago
Hi @tswangdi - what launcher are you using? To make sure I understand your ask correctly, you're seeing a scenario where the SIGTERM isn't being handled but it is in the code you pointed out?
Thanks for the reply.
what launcher are you using?
PDSH_LAUNCHER
I am running deepspeed in a Kubernetes Pod. When i delete the pod, kubelet will send SIGTERM to deepspeed. But deepspeed runner only catch SIGINT, and send SIGINT/SIGTERM to subprocesses. Deepspeed runner should catch more signals (like SIGSTOP, SIGTERM) and stop subprocesses gracefully.
Now, when the runner receives a SIGTERM, it will shut down, but the subprocesses will still keep running.
@tswangdi - can you test the linked PR (if you're able to build a version of DeepSpeed from source) to confirm this fixes the issue you're seeing? I don't have a Kubernetes pod to test on unfortunately.
Thank you for your work. 👍
Command deepspeed can catch SIGINT and stop the subprocess (code).
In Kubernetes, kubelet sends process SIGTERM, which is not handled by deepspeed, before closing a container. If deepspeed can handle SIGTERM and transmit to training processes, it will be helpful.