This repository has been deprecated, and will be archived soon (Nov 30th, 2021). Please consider to user normal Jobs for non-distributed cases and kubeflow/mpi-controller for distributed cases.
Experimental repo notice: This repository is experimental and currently only serves as a proof of concept for running distributed training with Chainer/ChainerMN on Kubernetes.
ChainerJob
provides a Kubernetes custom resource that makes it easy to run distributed or non-distributed Chainer jobs on Kubernetes.
Using a Custom Resource Definition (CRD) gives users the ability to create and manage Chainer Jobs just like builtin K8s resources. For example to create a job
$ kubectl create -f examples/chainerjob.yaml
chainerjob.kubeflow.org "example-job" created
To list chainer jobs:
$ kubectl get chainerjobs
NAME AGE
example-job 12s
kubectl create -f deploy/
This will create:
ChainerJob
Custom Resource Definition (CRD)chainer-operator
namespaceServiceAccount
ClusterRole
2-rbac.yaml
for detailed authorized operationsClusterRoleBinding
Deployment
for the chainer-operatorOnce defining ChainerJob
CRD and operator is up, you create a job by defining ChainerJob
custom resource.
kubectl create -f examples/chainerjob-mn.yaml
In this case the job spec looks like this:
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
name: example-job-mn
spec:
backend: mpi
master:
template:
spec:
containers:
- name: chainer
image: everpeace/chainermn:1.3.0
command:
- sh
- -c
- |
mpiexec -n 3 -N 1 --allow-run-as-root --display-map --mca mpi_cuda_support 0 \
python3 /train_mnist.py -e 2 -b 1000 -u 100
workerSets:
ws0:
replicas: 2
template:
spec:
containers:
- name: chainer
image: everpeace/chainermn:1.3.0
command:
- sh
- -c
- |
while true; do sleep 1 & wait; done
ChainerJob
consists of Master/Workers.
ChainerJob
must have only one mastermaster
is a pod (job technically) to boot your entire distributed job.chainer
master
will be restarted automatically when it failed. You can customize retry behavior with activeDeadlineSeconds
/backoffLimit
. Please see examples/chainerjob-reference.yaml for details.ChainerJob
can have 0 or more WorkerSets.backend
define the to initiate process groups and exchange tensor data among the processes. mpi
.backend: mpi
master
and workerSets
hostfile
and required configurations will be generated automaticallyslots=
clause in hostfile
can be configurable. Please see examples/chainerjob-reference.yaml for details.
chainer
.Kubernetes supports to schedule GPUs (instructions on GKE).
Once you get GPU equipped cluster, you can attach nvidia.com/gpu
resource to your ChainerJob
definition like this.
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
name: example-job-mn
spec:
backend: mpi
master:
template:
spec:
containers:
- name: chainer
image: everpeace/chainermn:1.3.0
resources:
limits:
nvidia.com/gpu: 1
...
Follow chainer's instruction for using in Chainer.
To get status of your ChainerJob
$ kubectl get chainerjobs $JOB_NAME -o yaml
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
...
status:
completionTime: 2018-06-13T02:13:47Z
conditions:
- lastProbeTime: 2018-06-13T02:13:47Z
lastTransitionTime: 2018-06-13T02:13:47Z
status: "True"
type: Complete
startTime: 2018-06-13T02:04:47Z
succeeded: 1
You can also list all the pods belonging ChainerJob
by using label chainerjob.kubeflow.org/name
.
$ kubecl get all -l chainerjob.kubeflow.org/name=example-job-mn
NAME READY STATUS RESTARTS AGE
pod/example-job-mn-master-jm9qw 1/1 Running 0 1m
pod/example-job-mn-workerset-ws0-0 1/1 Running 0 1m
pod/example-job-mn-workerset-ws0-1 1/1 Running 0 1m
NAME DESIRED CURRENT AGE
statefulset.apps/example-job-mn-workerset-ws0 2 2 1m
NAME DESIRED SUCCESSFUL AGE
job.batch/example-job-mn-master 1 0 1m
Once you can get pod names which belongs to ChainerJob
, you can inspect logs in standard ways.
$ kubectl logs example-job-mn-master-jm9qw
Data for JOB [41689,1] offset 0
======================== JOB MAP ========================
Data for node: example-job-mn-master-8qvk2 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [41689,1] App: 0 Process rank: 0 Bound: UNBOUND
Data for node: example-job-mn-workerset-ws0-0 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [41689,1] App: 0 Process rank: 1 Bound: UNBOUND
Data for node: example-job-mn-workerset-ws0-1 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [41689,1] App: 0 Process rank: 2 Bound: UNBOUND
=============================================================
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
==========================================
Num process (COMM_WORLD): 3
Using hierarchical communicator
Num unit: 100
Num Minibatch-size: 1000
Num epoch: 2
==========================================
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 1.68413 0.87129 0.5325 0.807938 10.3654
2 0.58754 0.403208 0.8483 0.884564 16.4705
...