Closed Mayank12022 closed 1 year ago
Can you try add '--prefix', '/opt/amazon/openmpi' to the mpirun
command?
The change you mentioned:
command = ["mpirun", "--prefix", "/opt/amazon/openmpi", "--allow-run-as-root", ...
Does not work for me.
The container crashes.
(health_check_env) 88665a544570:mpi-operator mayankgp$ kubectl get pods | grep nccl-test-job-2c9b7b53b8a
nccl-test-job-2c9b7b53b8a-hc-gather-r3b0-launcher-mcp6f 0/1 RunContainerError 4 109s
nccl-test-job-2c9b7b53b8a-hc-gather-r3b0-worker-0 1/1 Running 0 109s
nccl-test-job-2c9b7b53b8a-hc-gather-r3b0-worker-1 1/1 Running 0 109s
what if you run something simpler like hostname
. e.g.
['/opt/amazon/openmpi/bin/mpirun', '--prefix', '/opt/amazon/openmpi', '--allow-run-as-root', '--tag-output', '-np', '576', 'hostname']
This is in the error message, have you tried that?
In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment.
IIUC, this disables branching (a worker communicating with a subset of workers, instead of all going through the driver).
@thoraxe I think you were able to run a large number of workers at some point? I seem to remember a parameter was needed.
Another reason might be that workers need to wait for other workers to be reachable, and for that you might need an entry point similar to this: https://github.com/kubeflow/mpi-operator/blob/master/build/base/intel-entrypoint.sh This comes from an offline discussion with @ArangoGutierrez
A quick check to see whether the entrypoint would solve the issue is to increase the number of retries:
spec:
runPolicy:
backoffLimit: 100
You can try https://github.com/kubeflow/mpi-operator/pull/465 this patch and see if it works for you
Got it resolve by adding routed=direct
Would you mind documenting this in the README for high number of workers?
Hi Everryone, I created MPI Operator V2 using
kubectl apply -f deploy/v2beta1/mpi-operator.yaml.
I ran NCCL test and they run fine when I am using less than 64 instances with 8 GPUs on each instance. (512 gpus)The issue occurs when I run NCCL tests on more than 64 instances for example 72 instances (576 gpus). Here is the error that I get.
FYI the NCCL tests work on
<=64
instances with 8 GPUs.When I was using V1 MPI operator its the same story. NCCL tests work on
<=64
instances with 8 GPUs each. but fails for more than 64 instances. The error on v1 MPI operator.If you see the Error msg in both the V1 and V2 cases are similar.
Please suggest how can I run NCCL tests on more that 64 instances 8 GPUs each. Here is the command for NCCL test that I run.