Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.
This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)
Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.
This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)
Wont work:
/bin/sh -c "python3 adaptdl_training_code.py"
Will work:
python3 adaptdl_training_code.py
See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination