mlbench / mlbench-old

!!!!!DEPRECATED!!!! distributed machine learning benchmark - a public benchmark of distributed ML solvers and frameworks
Apache License 2.0
40 stars 8 forks source link

Kubernetes reschedules master pod which causes lost data #68

Open Panaetius opened 6 years ago

Panaetius commented 6 years ago

Kubernetes can reschedule the master pod arbitrarily, mostly due to resource constraints.

In this case, all active and past jobs are lost in the Redis Qeue used for job management.

See https://estl.tech/deploying-redis-with-persistence-on-google-kubernetes-engine-c1d60f70a043 for more details

Postgresql persistence is also needed in a similar fashion

Panaetius commented 5 years ago

Results of initial analysis of the issue.

All these points are made under the following assumptions (which are true as of this writing):

Currently, the dashboard starts a background Redis Job that uses kubectl exec to call mpirun to start a training run, and stdout/stderr is piped by kubectl using a SPDY connection that is kept open during execution. If the Dashboard goes down, this connection is lost and there is no way to recover the stdout/stderr of the execution. Furthermore, the job itself in Redis is lost, since it's actively running while the experiment runs, and there's no simple way to restart it at the point where it was, even if the SPDY connection could be recovered.

But this problem also has a wider scope, Fault Tolerance in general. Workers can't write metrics while the dashboard is down, workers could fail, which is pretty much guaranteed for long running experiments.

A solution to one of these issues should go hand-in-hand with an overall solution for Fault Tolerance, since this is needed to permit longrunning experiments. Otherwise mlbench would be limited to experiments that take on the order of minutes, not hours, which would severely reduce its usefulness.

After some analysis, a workable solution with current K8s capabilities (and one similar to the approach adopted by kubeflow mpi-operator, see https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md ) would be the following components:

Dashboard/Master: The dashboard persists its Postgresql and Redis state to persistent storage (possibly the same volumes as the workers use, to limit the amount of resources needed). The current long-running background task with the SPDY kubectl exec connection is replaces with a periodic polling background task which reads stdout/stderr data directly through kubectl logs of the K8s Job.

K8s Job: The background task in the dashboard is replaced by a new K8s Job resource. Job resources allow to set restartPolicy:OnFailure, so K8s rescheduling can be prevented to some extent, alleviating some of the problem. This Job runs kubectl exec with mpirun on the workers and writes MPI output to the local logs (which can be retrieved with kubectl logs <jobname>. It consists of a mini-image that contains kubectl and a minimal bash script that: Checks that all workers are ready, Terminates all running mpirun processes on all workers, Starts a new mpirunrun (in that order). The script also passes along if this is the first execution of the Job or if it failed before and training should be resumed from a checkpoint.

Worker StatefulSet: Works mostly as before, but we have to make sure that all workers mpirun fails if one fails (default in OpenMPI), that regular checkpoints are written (we should do this anyways) and that metrics are not lost if the dashboard is unavailable (Cache entries and submit once it's available again?)

Normal execution would look like this:

  1. Dashboard starts the K8s Job and saves the name to the Database, creates a new periodic background task to get the logs of the job and save them.
  2. The Job starts and waits for all workers to become available/ready
  3. The Job terminates any already running mpirun processes on all workers
  4. The Job starts mpirun on the first worker, which starts the experiment
  5. The workers periodically checkpoint and write back metrics to the dashboard. The Job saves stdout/stderr from the kubectl exec connection to its logs, keeping the connection open.

On Dashboard Failure/Eviction:

The Dashboard is started by K8s on a different node. Data is not lost due to being saved to a Persistent Volume. Workers write all metrics that could not be written and the background task is periodically executed again, grabbing the Job stdout/stderr. No data is lost and training continues as normal.

On Job Failure: K8s restarts the job, the script waits for all workers again, kills the current mpirun and restarts training from checkpoint (since it knows that it failed before), thereby establishing a new SPDY connection for stderr/stdout. The dashboard can continue getting logs from the Job as if nothing happened. Metrics after the last checkpoint are discarded. The dashboard is notified of the Failure, so it can disregard time lost due to this process.

On Worker Failure/Eviction: Worker failure automatically means OpenMPI failure, which in turn means Job failure. K8s will restart the necessary worker pods, the Job will also be restarted by K8s, and all the Job Failure steps mentioned above apply.

In no case is relevant data lost, training continues as if nothing happened and this approach is robust towards failure of any of the components. Metrics themselves should not be affected by restarting from checkpoint.

Implementation of this entails several changes that should be done in their own tickets (since there's little dependencies between them and they can be worked on in parallel):