microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.24k stars 4.08k forks source link

[REQUEST] Handle SIGTERM #4098

Closed tswangdi closed 1 year ago

tswangdi commented 1 year ago

Command deepspeed can catch SIGINT and stop the subprocess (code).

In Kubernetes, kubelet sends process SIGTERM, which is not handled by deepspeed, before closing a container. If deepspeed can handle SIGTERM and transmit to training processes, it will be helpful.

loadams commented 1 year ago

Hi @tswangdi - what launcher are you using? To make sure I understand your ask correctly, you're seeing a scenario where the SIGTERM isn't being handled but it is in the code you pointed out?

tswangdi commented 1 year ago

Thanks for the reply.

what launcher are you using? PDSH_LAUNCHER

I am running deepspeed in a Kubernetes Pod. When i delete the pod, kubelet will send SIGTERM to deepspeed. But deepspeed runner only catch SIGINT, and send SIGINT/SIGTERM to subprocesses. Deepspeed runner should catch more signals (like SIGSTOP, SIGTERM) and stop subprocesses gracefully.

Now, when the runner receives a SIGTERM, it will shut down, but the subprocesses will still keep running.

loadams commented 1 year ago

@tswangdi - can you test the linked PR (if you're able to build a version of DeepSpeed from source) to confirm this fixes the issue you're seeing? I don't have a Kubernetes pod to test on unfortunately.

tswangdi commented 1 year ago

Thank you for your work. 👍