将 MxNet 分布式训练根据 DMLC_ROLE 拆分成多个独立 job。以便迁移到 k8s 环境中运行

xuerq commented 7 years ago

MxNet 分布式训练拆分为4部分：worker0, worker1 , server, scheduler
试验环境：单机、4卡gtx980、2 docker，模拟多机多卡： docker0：运行 worker0 进程 docker1：运行 worker1进程、server进程、scheduler进程
训练demo：官方的 cifar10 分布式例子

on mxnet0.1 worker0 分配了4个gpu

#!/bin/bash
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:; 
export DMLC_ROLE=worker; 
export DMLC_PS_ROOT_PORT=9092; 
export DMLC_PS_ROOT_URI=172.17.0.2; 
export DMLC_NUM_SERVER=1; 
export DMLC_NUM_WORKER=2; 
python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3 --kv-store dist_device_sync

worker0 部分输出

INFO:root:Epoch[3] Batch [20]   Speed: 675.00 samples/sec       Train-accuracy=0.609375
INFO:root:Epoch[3] Batch [40]   Speed: 647.27 samples/sec       Train-accuracy=0.612500
INFO:root:Epoch[3] Batch [60]   Speed: 638.34 samples/sec       Train-accuracy=0.619531
INFO:root:Epoch[3] Batch [80]   Speed: 645.63 samples/sec       Train-accuracy=0.638672
INFO:root:Epoch[3] Batch [100]  Speed: 645.55 samples/sec       Train-accuracy=0.621094
INFO:root:Epoch[3] Batch [120]  Speed: 648.92 samples/sec       Train-accuracy=0.619922
INFO:root:Epoch[3] Batch [140]  Speed: 643.61 samples/sec       Train-accuracy=0.633203
INFO:root:Epoch[3] Batch [160]  Speed: 638.31 samples/sec       Train-accuracy=0.645703
INFO:root:Epoch[3] Batch [180]  Speed: 631.67 samples/sec       Train-accuracy=0.634766
INFO:root:Epoch[3] Resetting Data Iterator
INFO:root:Epoch[3] Time cost=39.194
INFO:root:Epoch[3] Validation-accuracy=0.625586

on mxnet0.1 worker1 分配了2个gpu

#!/bin/bash
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:; 
export DMLC_ROLE=worker; 
export DMLC_PS_ROOT_PORT=9092; 
export DMLC_PS_ROOT_URI=172.17.0.2; 
export DMLC_NUM_SERVER=1; 
export DMLC_NUM_WORKER=2; 
python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1 --kv-store dist_device_sync

#!/bin/bash
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:; 
export DMLC_ROLE=server; 
export DMLC_PS_ROOT_PORT=9092; 
export DMLC_PS_ROOT_URI=172.17.0.2; 
export DMLC_NUM_SERVER=1; 
export DMLC_NUM_WORKER=2; 
python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --kv-store dist_device_sync

#!/bin/bash
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:; 
export DMLC_ROLE=scheduler; 
export DMLC_PS_ROOT_PORT=9092; 
export DMLC_PS_ROOT_URI=172.17.0.2; 
export DMLC_NUM_SERVER=1; 
export DMLC_NUM_WORKER=2; 
python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --kv-store dist_device_sync

worker1 部分输出

INFO:root:Epoch[3] Validation-accuracy=0.628516
INFO:root:Epoch[4] Batch [20]   Speed: 682.58 samples/sec   Train-accuracy=0.658984
INFO:root:Epoch[4] Batch [40]   Speed: 646.23 samples/sec   Train-accuracy=0.685156
INFO:root:Epoch[4] Batch [60]   Speed: 636.69 samples/sec   Train-accuracy=0.665234
INFO:root:Epoch[4] Batch [80]   Speed: 640.98 samples/sec   Train-accuracy=0.669141
INFO:root:Epoch[4] Batch [100]  Speed: 645.74 samples/sec   Train-accuracy=0.671484
INFO:root:Epoch[4] Batch [120]  Speed: 649.37 samples/sec   Train-accuracy=0.676562
INFO:root:Epoch[4] Batch [140]  Speed: 645.67 samples/sec   Train-accuracy=0.691016
INFO:root:Epoch[4] Batch [160]  Speed: 646.50 samples/sec   Train-accuracy=0.692187
INFO:root:Epoch[4] Batch [180]  Speed: 620.38 samples/sec   Train-accuracy=0.687891
INFO:root:Epoch[4] Resetting Data Iterator
INFO:root:Epoch[4] Time cost=38.919
INFO:root:Epoch[4] Validation-accuracy=0.711914

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.44                 Driver Version: 367.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 0000:04:00.0     Off |                  N/A |
| 30%   49C    P2    84W / 180W |   1336MiB /  4037MiB |     71%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980     Off  | 0000:05:00.0     Off |                  N/A |
| 28%   47C    P2    84W / 180W |   1337MiB /  4037MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 980     Off  | 0000:08:00.0     Off |                  N/A |
| 26%   38C    P2    49W / 180W |    524MiB /  4037MiB |     28%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 980     Off  | 0000:09:00.0     Off |                  N/A |
| 26%   37C    P2    48W / 180W |    525MiB /  4037MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16916    C   python                                         521MiB |
|    0     16947    C   python                                         811MiB |
|    1     16916    C   python                                         521MiB |
|    1     16947    C   python                                         812MiB |
|    2     16916    C   python                                         522MiB |
|    3     16916    C   python                                         523MiB |
+-----------------------------------------------------------------------------+

xuerq commented 7 years ago

将分离的多个 job 在迁移到 kubernetes 环境下启动 2 pod ：

pod0，中启动 server 、 scheduler 和 worker0 进程
pod1，中启动 worker1进程

apiVersion: v1
kind: Pod
metadata:
  name: mxnet-ssw0
spec:
  containers:
  - name: scheduler
    image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
    ports:
    - containerPort: 9092
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 2
    command: ["/bin/sh", "-c"]
    args: ["cd /root/mxnet/example/image-classification; \
            cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
            cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
            export DMLC_PS_ROOT_PORT=9092; \
            export DMLC_PS_ROOT_URI=10.1.24.2; \
            export DMLC_NUM_SERVER=1; \
            export DMLC_NUM_WORKER=2; \
            export DMLC_ROLE=scheduler; sh runMulti_scheduler.sh; \
            export DMLC_ROLE=server; sh runMulti_server.sh; \
            export DMLC_ROLE=worker; sh runMulti_worker.sh /root/xuerq/mxnet/worker0.log; \
            tail -f /root/xuerq/mxnet/worker0.log
           "]
    volumeMounts:
    - name: work-path
      mountPath: /root/xuerq/mxnet
      readOnly: false
    - name: nvidia-libs-volume
      mountPath: /usr/local/nvidia
      readOnly: true
    - name: nvidia-tools-volume
      mountPath: /usr/local/nvidia/bin
      readOnly: true
  restartPolicy: Never
  volumes:
  - name: work-path
    hostPath: 
      path: /root/xuerq/mxnet
  - name: nvidia-libs-volume
    hostPath: 
      path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
  - name: nvidia-tools-volume
    hostPath: 
      path: /usr/local/nvidia/bin
  nodeName: 0c-c4-7a-82-c5-b8
---
apiVersion: v1
kind: Pod
metadata:
  name: mxnet-ssw1
spec:
  containers:
  - name: scheduler
    image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
    ports:
    - containerPort: 2222
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 2
    command: ["/bin/sh", "-c"]
    args: ["cd /root/mxnet/example/image-classification; \
            cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
            cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
            export DMLC_PS_ROOT_PORT=9092; \
            export DMLC_PS_ROOT_URI=10.1.24.2; \
            export DMLC_NUM_SERVER=1; \
            export DMLC_NUM_WORKER=2; \
            export DMLC_ROLE=worker; sh runMulti_worker.sh /root/xuerq/mxnet/worker1.log; \
            tail -f /root/xuerq/mxnet/worker1.log
           "]
    volumeMounts:
    - name: work-path
      mountPath: /root/xuerq/mxnet
      readOnly: false
    - name: nvidia-libs-volume
      mountPath: /usr/local/nvidia
      readOnly: true
    - name: nvidia-tools-volume
      mountPath: /usr/local/nvidia/bin
      readOnly: true
  restartPolicy: Never
  volumes:
  - name: work-path
    hostPath: 
      path: /root/xuerq/mxnet
  - name: nvidia-libs-volume
    hostPath: 
      path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
  - name: nvidia-tools-volume
    hostPath: 
      path: /usr/local/nvidia/bin
  nodeName: 0c-c4-7a-82-c5-b8

pineking commented 7 years ago

目前的 k8s ，如果将多个 pod 放到一个 yaml 文件里 create，会存在这几个 pod 分配到同一个 GPU 上的问题 https://github.com/kubernetes/kubernetes/pull/28216#issuecomment-268173401，不确定是否解决了

xuerq commented 7 years ago

下回分开多个 yaml 声明试试

xuerq commented 7 years ago

更新 MxNet on k8s 分布式性能测试的 yaml ：跟上面的 yaml 总体上差不多，只是在启动脚本里增加了两个参数：batch_size 和使用的 gpu id ，方便调试

单机多卡的 runMulti_worker.sh:

nohup python train_cifar10.py --network resnet --num-layers 110 --batch-size $1 --gpus $2 >$3 2>&1 &

分布式单机多卡以及分布式多机多卡的 runMulti_worker.sh:

nohup python train_cifar10.py --network resnet --num-layers 110 --batch-size $1 --gpus $2 --kv-store dist_device_sync >$3 2>&1 &

runMulti_server.sh:

nohup python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --kv-store dist_device_sync >/root/xuerq/mxnet/server.log 2>&1 &

runMulti_scheduler.sh:

nohup python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --kv-store dist_device_sync >/root/xuerq/mxnet/scheduler.log 2>&1 &

单机多卡或分布式单机多卡的 yaml ：单机多卡或分布式单机多卡的 yaml 相同，区别是 runMulti_worker.sh

apiVersion: v1
kind: Pod
metadata:
  name: mxnet-ssw0
spec:
  containers:
  - name: scheduler
    image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
    ports:
    - containerPort: 9092
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
    command: ["/bin/sh", "-c"]
    args: ["cd /root/mxnet/example/image-classification; \
            cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
            cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
            export DMLC_PS_ROOT_PORT=9092; \
            export DMLC_PS_ROOT_URI=mxnet-ssw0-service.default.svc.cluster.local \
            export DMLC_NUM_SERVER=1; \
            export DMLC_NUM_WORKER=1; \
            export MXNET_ENABLE_GPU_P2P=1; \
            export DMLC_ROLE=scheduler; sh runMulti_scheduler.sh; \
            export DMLC_ROLE=server; sh runMulti_server.sh; \
            export DMLC_ROLE=worker; sh runMulti_worker.sh 128 0 /root/xuerq/mxnet/worker0.log; \
            tail -f /root/xuerq/mxnet/worker0.log
           "]
    volumeMounts:
    - name: work-path
      mountPath: /root/xuerq/mxnet
      readOnly: false
    - name: nvidia-libs-volume
      mountPath: /usr/local/nvidia
      readOnly: true
    - name: nvidia-tools-volume
      mountPath: /usr/local/nvidia/bin
      readOnly: true
  restartPolicy: Never
  volumes:
  - name: work-path
    hostPath: 
      path: /root/xuerq/mxnet
  - name: nvidia-libs-volume
    hostPath: 
      path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
  - name: nvidia-tools-volume
    hostPath: 
      path: /usr/local/nvidia/bin
  nodeName: 0c-c4-7a-82-c5-bc

分布式多机多卡的 yaml ：

apiVersion: v1
kind: Pod
metadata:
  name: mxnet-ssw0
  labels:
    name: mxnet-ssw0
spec:
  containers:
  - name: scheduler
    image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
    ports:
    - containerPort: 9092
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
    command: ["/bin/sh", "-c"]
    args: ["cd /root/mxnet/example/image-classification; \
            cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
            cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
            export DMLC_PS_ROOT_PORT=9092; \
            export DMLC_PS_ROOT_URI=mxnet-ssw0-service.default.svc.cluster.local \
            export DMLC_NUM_SERVER=1; \
            export DMLC_NUM_WORKER=2; \
            export DMLC_ROLE=scheduler; sh runMulti_scheduler.sh; \
            export DMLC_ROLE=server; sh runMulti_server.sh; \
            export DMLC_ROLE=worker; sh runMulti_worker.sh 256 0 /root/xuerq/mxnet/worker0.log; \
            tail -f /root/xuerq/mxnet/worker0.log
           "]
    volumeMounts:
    - name: work-path
      mountPath: /root/xuerq/mxnet
      readOnly: false
    - name: nvidia-libs-volume
      mountPath: /usr/local/nvidia
      readOnly: true
    - name: nvidia-tools-volume
      mountPath: /usr/local/nvidia/bin
      readOnly: true
  restartPolicy: Never
  volumes:
  - name: work-path
    hostPath: 
      path: /root/xuerq/mxnet
  - name: nvidia-libs-volume
    hostPath: 
      path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
  - name: nvidia-tools-volume
    hostPath: 
      path: /usr/local/nvidia/bin
  nodeName: 0c-c4-7a-82-c5-bc
---
apiVersion: v1
kind: Pod
metadata:
  name: mxnet-ssw1
  labels:
    name: mxnet-ssw1
spec:
  containers:
  - name: scheduler
    image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
    ports:
    - containerPort: 2222
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
    command: ["/bin/sh", "-c"]
    args: ["cd /root/mxnet/example/image-classification; \
            cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
            cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
            export DMLC_PS_ROOT_PORT=9092; \
            export DMLC_PS_ROOT_URI=mxnet-ssw0-service.default.svc.cluster.local \
            export DMLC_NUM_SERVER=1; \
            export DMLC_NUM_WORKER=2; \
            export DMLC_ROLE=worker; sh runMulti_worker.sh 256 0 /root/xuerq/mxnet/worker1.log; \
            tail -f /root/xuerq/mxnet/worker1.log
           "]
    volumeMounts:
    - name: work-path
      mountPath: /root/xuerq/mxnet
      readOnly: false
    - name: nvidia-libs-volume
      mountPath: /usr/local/nvidia
      readOnly: true
    - name: nvidia-tools-volume
      mountPath: /usr/local/nvidia/bin
      readOnly: true
  restartPolicy: Never
  volumes:
  - name: work-path
    hostPath: 
      path: /root/xuerq/mxnet
  - name: nvidia-libs-volume
    hostPath: 
      path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
  - name: nvidia-tools-volume
    hostPath: 
      path: /usr/local/nvidia/bin
  nodeName: 0c-c4-7a-82-c5-b8

xuerq commented 7 years ago

补充 mxnet_pod_service.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    name: mxnet-ssw0
    role: service
  name: mxnet-ssw0-service
spec:
  ports:
    - port: 9092
      targetPort: 9092
  selector:
    name: mxnet-ssw0
---
apiVersion: v1
kind: Service
metadata:
  labels:
    name: mxnet-ssw1
    role: service
  name: mxnet-ssw1-service
spec:
  ports:
    - port: 2222
      targetPort: 2222
  selector:
    name: mxnet-ssw1

unisound-ail / atlas

将 MxNet 分布式训练根据 DMLC_ROLE 拆分成多个独立 job。以便迁移到 k8s 环境中运行 #3