Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail.

Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.

This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)

Wont work: /bin/sh -c "python3 adaptdl_training_code.py"

Will work: python3 adaptdl_training_code.py

See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

petuum / adaptdl

Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail. #105