Open xuerq opened 7 years ago
将分离的多个 job 在迁移到 kubernetes 环境下 启动 2 pod :
apiVersion: v1
kind: Pod
metadata:
name: mxnet-ssw0
spec:
containers:
- name: scheduler
image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
ports:
- containerPort: 9092
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2
command: ["/bin/sh", "-c"]
args: ["cd /root/mxnet/example/image-classification; \
cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
export DMLC_PS_ROOT_PORT=9092; \
export DMLC_PS_ROOT_URI=10.1.24.2; \
export DMLC_NUM_SERVER=1; \
export DMLC_NUM_WORKER=2; \
export DMLC_ROLE=scheduler; sh runMulti_scheduler.sh; \
export DMLC_ROLE=server; sh runMulti_server.sh; \
export DMLC_ROLE=worker; sh runMulti_worker.sh /root/xuerq/mxnet/worker0.log; \
tail -f /root/xuerq/mxnet/worker0.log
"]
volumeMounts:
- name: work-path
mountPath: /root/xuerq/mxnet
readOnly: false
- name: nvidia-libs-volume
mountPath: /usr/local/nvidia
readOnly: true
- name: nvidia-tools-volume
mountPath: /usr/local/nvidia/bin
readOnly: true
restartPolicy: Never
volumes:
- name: work-path
hostPath:
path: /root/xuerq/mxnet
- name: nvidia-libs-volume
hostPath:
path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
- name: nvidia-tools-volume
hostPath:
path: /usr/local/nvidia/bin
nodeName: 0c-c4-7a-82-c5-b8
---
apiVersion: v1
kind: Pod
metadata:
name: mxnet-ssw1
spec:
containers:
- name: scheduler
image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
ports:
- containerPort: 2222
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2
command: ["/bin/sh", "-c"]
args: ["cd /root/mxnet/example/image-classification; \
cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
export DMLC_PS_ROOT_PORT=9092; \
export DMLC_PS_ROOT_URI=10.1.24.2; \
export DMLC_NUM_SERVER=1; \
export DMLC_NUM_WORKER=2; \
export DMLC_ROLE=worker; sh runMulti_worker.sh /root/xuerq/mxnet/worker1.log; \
tail -f /root/xuerq/mxnet/worker1.log
"]
volumeMounts:
- name: work-path
mountPath: /root/xuerq/mxnet
readOnly: false
- name: nvidia-libs-volume
mountPath: /usr/local/nvidia
readOnly: true
- name: nvidia-tools-volume
mountPath: /usr/local/nvidia/bin
readOnly: true
restartPolicy: Never
volumes:
- name: work-path
hostPath:
path: /root/xuerq/mxnet
- name: nvidia-libs-volume
hostPath:
path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
- name: nvidia-tools-volume
hostPath:
path: /usr/local/nvidia/bin
nodeName: 0c-c4-7a-82-c5-b8
目前的 k8s ,如果将多个 pod 放到一个 yaml 文件里 create,会存在这几个 pod 分配到同一个 GPU 上的问题 https://github.com/kubernetes/kubernetes/pull/28216#issuecomment-268173401,不确定是否解决了
下回分开多个 yaml 声明 试试
更新 MxNet on k8s 分布式性能测试的 yaml : 跟上面的 yaml 总体上差不多,只是在启动脚本里增加了两个参数:batch_size 和 使用的 gpu id ,方便调试
单机多卡的 runMulti_worker.sh:
nohup python train_cifar10.py --network resnet --num-layers 110 --batch-size $1 --gpus $2 >$3 2>&1 &
分布式单机多卡以及分布式多机多卡的 runMulti_worker.sh:
nohup python train_cifar10.py --network resnet --num-layers 110 --batch-size $1 --gpus $2 --kv-store dist_device_sync >$3 2>&1 &
runMulti_server.sh:
nohup python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --kv-store dist_device_sync >/root/xuerq/mxnet/server.log 2>&1 &
runMulti_scheduler.sh:
nohup python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --kv-store dist_device_sync >/root/xuerq/mxnet/scheduler.log 2>&1 &
单机多卡或分布式单机多卡的 yaml : 单机多卡或分布式单机多卡的 yaml 相同,区别是 runMulti_worker.sh
apiVersion: v1
kind: Pod
metadata:
name: mxnet-ssw0
spec:
containers:
- name: scheduler
image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
ports:
- containerPort: 9092
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
command: ["/bin/sh", "-c"]
args: ["cd /root/mxnet/example/image-classification; \
cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
export DMLC_PS_ROOT_PORT=9092; \
export DMLC_PS_ROOT_URI=mxnet-ssw0-service.default.svc.cluster.local \
export DMLC_NUM_SERVER=1; \
export DMLC_NUM_WORKER=1; \
export MXNET_ENABLE_GPU_P2P=1; \
export DMLC_ROLE=scheduler; sh runMulti_scheduler.sh; \
export DMLC_ROLE=server; sh runMulti_server.sh; \
export DMLC_ROLE=worker; sh runMulti_worker.sh 128 0 /root/xuerq/mxnet/worker0.log; \
tail -f /root/xuerq/mxnet/worker0.log
"]
volumeMounts:
- name: work-path
mountPath: /root/xuerq/mxnet
readOnly: false
- name: nvidia-libs-volume
mountPath: /usr/local/nvidia
readOnly: true
- name: nvidia-tools-volume
mountPath: /usr/local/nvidia/bin
readOnly: true
restartPolicy: Never
volumes:
- name: work-path
hostPath:
path: /root/xuerq/mxnet
- name: nvidia-libs-volume
hostPath:
path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
- name: nvidia-tools-volume
hostPath:
path: /usr/local/nvidia/bin
nodeName: 0c-c4-7a-82-c5-bc
分布式多机多卡的 yaml :
apiVersion: v1
kind: Pod
metadata:
name: mxnet-ssw0
labels:
name: mxnet-ssw0
spec:
containers:
- name: scheduler
image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
ports:
- containerPort: 9092
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
command: ["/bin/sh", "-c"]
args: ["cd /root/mxnet/example/image-classification; \
cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
export DMLC_PS_ROOT_PORT=9092; \
export DMLC_PS_ROOT_URI=mxnet-ssw0-service.default.svc.cluster.local \
export DMLC_NUM_SERVER=1; \
export DMLC_NUM_WORKER=2; \
export DMLC_ROLE=scheduler; sh runMulti_scheduler.sh; \
export DMLC_ROLE=server; sh runMulti_server.sh; \
export DMLC_ROLE=worker; sh runMulti_worker.sh 256 0 /root/xuerq/mxnet/worker0.log; \
tail -f /root/xuerq/mxnet/worker0.log
"]
volumeMounts:
- name: work-path
mountPath: /root/xuerq/mxnet
readOnly: false
- name: nvidia-libs-volume
mountPath: /usr/local/nvidia
readOnly: true
- name: nvidia-tools-volume
mountPath: /usr/local/nvidia/bin
readOnly: true
restartPolicy: Never
volumes:
- name: work-path
hostPath:
path: /root/xuerq/mxnet
- name: nvidia-libs-volume
hostPath:
path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
- name: nvidia-tools-volume
hostPath:
path: /usr/local/nvidia/bin
nodeName: 0c-c4-7a-82-c5-bc
---
apiVersion: v1
kind: Pod
metadata:
name: mxnet-ssw1
labels:
name: mxnet-ssw1
spec:
containers:
- name: scheduler
image: harbor.ail.unisound.com/liuqs_public/cuda-mxnet-distributed:7.5
ports:
- containerPort: 2222
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
command: ["/bin/sh", "-c"]
args: ["cd /root/mxnet/example/image-classification; \
cp -r /root/xuerq/mxnet/data/* /root/mxnet/example/image-classification/data/; \
cp -r /root/xuerq/mxnet/runMulti* /root/mxnet/example/image-classification/; \
export DMLC_PS_ROOT_PORT=9092; \
export DMLC_PS_ROOT_URI=mxnet-ssw0-service.default.svc.cluster.local \
export DMLC_NUM_SERVER=1; \
export DMLC_NUM_WORKER=2; \
export DMLC_ROLE=worker; sh runMulti_worker.sh 256 0 /root/xuerq/mxnet/worker1.log; \
tail -f /root/xuerq/mxnet/worker1.log
"]
volumeMounts:
- name: work-path
mountPath: /root/xuerq/mxnet
readOnly: false
- name: nvidia-libs-volume
mountPath: /usr/local/nvidia
readOnly: true
- name: nvidia-tools-volume
mountPath: /usr/local/nvidia/bin
readOnly: true
restartPolicy: Never
volumes:
- name: work-path
hostPath:
path: /root/xuerq/mxnet
- name: nvidia-libs-volume
hostPath:
path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.20
- name: nvidia-tools-volume
hostPath:
path: /usr/local/nvidia/bin
nodeName: 0c-c4-7a-82-c5-b8
补充 mxnet_pod_service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
name: mxnet-ssw0
role: service
name: mxnet-ssw0-service
spec:
ports:
- port: 9092
targetPort: 9092
selector:
name: mxnet-ssw0
---
apiVersion: v1
kind: Service
metadata:
labels:
name: mxnet-ssw1
role: service
name: mxnet-ssw1-service
spec:
ports:
- port: 2222
targetPort: 2222
selector:
name: mxnet-ssw1
on mxnet0.1 worker0 分配了4个gpu
worker0 部分输出
on mxnet0.1 worker1 分配了2个gpu
worker1 部分输出
nvidia-smi