Closed kongjibai closed 1 year ago
Please post the yaml file of the job
Please post the yaml file of the job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
# name: lm-hvd-job-tf-mnist
name: lm-horovod-job
# namespace: vc-horovod-test
# namespace: default
labels:
"volcano.sh/job-type": Horovod
spec:
minAvailable: 9
schedulerName: volcano
plugins:
ssh: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
WORKER_HOST=`cat /home/etc-volcano/volcano/worker.host | tr "\n" ","`;
mkdir -p /var/run/sshd; /usr/sbin/sshd;
cd /home/vc-hvd-test;
mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed=direct -np 8 python torch_mnist_lm.py;
image: vc-hvd-test:v1.0
name: master
ports:
- containerPort: 22
name: job-port
volumeMounts:
- name: vc-hvd-home
mountPath: /home
resources:
requests:
cpu: "500m"
memory: "1024Mi"
limits:
cpu: "500m"
memory: "1024Mi"
volumes:
- name: vc-hvd-home
persistentVolumeClaim:
claimName: nfs-pvc02
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
- replicas: 8
name: worker
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: vc-hvd-test:v1.0
name: worker
ports:
- containerPort: 22
name: job-port
volumeMounts:
- name: vc-hvd-home
mountPath: /home
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: vc-hvd-home
persistentVolumeClaim:
claimName: nfs-pvc02
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
---
Please try delayed start master.
You can try use dependsOn.
Please try delayed start master.
You can try use dependsOn.
I have tried use dependsOn
, but there is no matser job and the worker jobs always in Pending
status. After i deleted the job and apply it again, the output of kubectl get pod
is No resources found in default namespace
. I can only clean up volcano and reinstall it, if i try to use dependsOn
and apply a job, the matser and worker jobs will as same before. bellow is the .yaml file.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
# name: lm-hvd-job-tf-mnist
name: lm-horovod-job
# namespace: vc-horovod-test
# namespace: default
labels:
"volcano.sh/job-type": Horovod
spec:
minAvailable: 9
schedulerName: volcano
plugins:
ssh: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
WORKER_HOST=`cat /home/etc-volcano/volcano/worker.host | tr "\n" ","`;
mkdir -p /var/run/sshd; /usr/sbin/sshd;
cd /home/vc-hvd-test;
mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed=direct -np 8 python torch_mnist_lm.py;
image: vc-hvd-test:v1.0
name: master
ports:
- containerPort: 22
name: job-port
volumeMounts:
- name: vc-hvd-home
mountPath: /home
resources:
requests:
cpu: "500m"
memory: "1024Mi"
limits:
cpu: "500m"
memory: "1024Mi"
volumes:
- name: vc-hvd-home
persistentVolumeClaim:
claimName: nfs-pvc02
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
dependsOn:
name:
- "worker"
- replicas: 8
name: worker
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: vc-hvd-test:v1.0
name: worker
ports:
- containerPort: 22
name: job-port
volumeMounts:
- name: vc-hvd-home
mountPath: /home
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: vc-hvd-home
persistentVolumeClaim:
claimName: nfs-pvc02
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
---
It is speculated that there is a conflict with the gang plugin.
You can disable gang plugin or set minavailable
== worker.replicas
@hwdef Perhaps we should consider more about mpi requirements and provides more test results at mpi plugin
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗
when i use volcano start a horovod tf job,the
lm-horovod-job-master-0
node will run error and after restart 3 times, it'll be Running status, because itPermanently added 'lm-horovod-job-worker-0.lm-horovod-job,10.10.10.10' (ECDSA) to the list of known hosts
. the output info as below, anyone met or solved problem like this? i use volcano 1.5.1, horovod 0.24.3, tf1.15, open mpi 4.0.0, cuda 10.0, ubuntu 18.04 on K8s docker image.