Open Panaetius opened 6 years ago
Results of initial analysis of the issue.
All these points are made under the following assumptions (which are true as of this writing):
Job
resourceskubectl exec
Currently, the dashboard starts a background Redis Job that uses kubectl exec
to call mpirun
to start a training run, and stdout/stderr is piped by kubectl
using a SPDY connection that is kept open during execution. If the Dashboard goes down, this connection is lost and there is no way to recover the stdout/stderr of the execution. Furthermore, the job itself in Redis is lost, since it's actively running while the experiment runs, and there's no simple way to restart it at the point where it was, even if the SPDY connection could be recovered.
But this problem also has a wider scope, Fault Tolerance in general. Workers can't write metrics while the dashboard is down, workers could fail, which is pretty much guaranteed for long running experiments.
A solution to one of these issues should go hand-in-hand with an overall solution for Fault Tolerance, since this is needed to permit longrunning experiments. Otherwise mlbench would be limited to experiments that take on the order of minutes, not hours, which would severely reduce its usefulness.
After some analysis, a workable solution with current K8s capabilities (and one similar to the approach adopted by kubeflow mpi-operator, see https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md ) would be the following components:
Dashboard/Master: The dashboard persists its Postgresql and Redis state to persistent storage (possibly the same volumes as the workers use, to limit the amount of resources needed). The current long-running background task with the SPDY kubectl exec
connection is replaces with a periodic polling background task which reads stdout/stderr data directly through kubectl logs
of the K8s Job.
K8s Job: The background task in the dashboard is replaced by a new K8s Job
resource. Job
resources allow to set restartPolicy:OnFailure
, so K8s rescheduling can be prevented to some extent, alleviating some of the problem. This Job
runs kubectl exec
with mpirun
on the workers and writes MPI output to the local logs (which can be retrieved with kubectl logs <jobname>
. It consists of a mini-image that contains kubectl
and a minimal bash script that: Checks that all workers are ready, Terminates all running mpirun
processes on all workers, Starts a new mpirun
run (in that order). The script also passes along if this is the first execution of the Job or if it failed before and training should be resumed from a checkpoint.
Worker StatefulSet: Works mostly as before, but we have to make sure that all workers mpirun
fails if one fails (default in OpenMPI), that regular checkpoints are written (we should do this anyways) and that metrics are not lost if the dashboard is unavailable (Cache entries and submit once it's available again?)
Normal execution would look like this:
Job
and saves the name to the Database, creates a new periodic background task to get the logs of the job and save them.Job
starts and waits for all workers to become available/readyJob
terminates any already running mpirun
processes on all workersJob
starts mpirun
on the first worker, which starts the experimentJob
saves stdout/stderr from the kubectl exec
connection to its logs, keeping the connection open.On Dashboard Failure/Eviction:
The Dashboard is started by K8s on a different node. Data is not lost due to being saved to a Persistent Volume. Workers write all metrics that could not be written and the background task is periodically executed again, grabbing the Job
stdout/stderr. No data is lost and training continues as normal.
On Job
Failure:
K8s restarts the job, the script waits for all workers again, kills the current mpirun
and restarts training from checkpoint (since it knows that it failed before), thereby establishing a new SPDY connection for stderr/stdout. The dashboard can continue getting logs from the Job
as if nothing happened. Metrics after the last checkpoint are discarded. The dashboard is notified of the Failure, so it can disregard time lost due to this process.
On Worker Failure/Eviction:
Worker failure automatically means OpenMPI failure, which in turn means Job
failure. K8s will restart the necessary worker pods, the Job
will also be restarted by K8s, and all the Job
Failure steps mentioned above apply.
In no case is relevant data lost, training continues as if nothing happened and this approach is robust towards failure of any of the components. Metrics themselves should not be affected by restarting from checkpoint.
Implementation of this entails several changes that should be done in their own tickets (since there's little dependencies between them and they can be worked on in parallel):
Job
(needs a docker image containing kubectl and the bashscript described above) mlbench/mlbench-dashboard/issues/2Job
instead of running kubectl exec directly, add background task for polling Job
logs. mlbench/mlbench-dashboard/issues/1
Kubernetes can reschedule the master pod arbitrarily, mostly due to resource constraints.
In this case, all active and past jobs are lost in the Redis Qeue used for job management.
See https://estl.tech/deploying-redis-with-persistence-on-google-kubernetes-engine-c1d60f70a043 for more details
Postgresql persistence is also needed in a similar fashion