kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

NCCL tests failures when running on more than 64 instances with 8 GPUs each using MPI operator V2 beta1 #501

Closed Mayank12022 closed 1 year ago

Mayank12022 commented 1 year ago

Hi Everryone, I created MPI Operator V2 using kubectl apply -f deploy/v2beta1/mpi-operator.yaml. I ran NCCL test and they run fine when I am using less than 64 instances with 8 GPUs on each instance. (512 gpus)

The issue occurs when I run NCCL tests on more than 64 instances for example 72 instances (576 gpus). Here is the error that I get.

bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.
*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
Warning: Permanently added 'nccl-test-job-73b0a388018-hc-gather-r3b0-worker-62.nccl-test-job-73b0a388018-hc-gather-r3b0-worker,192.168.220.206' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
  my node:   nccl-test-job-73b0a388018-hc-gather-r3b0-launcher
  target node:  nccl-test-job-73b0a388018-hc-gather-r3b0-worker-2.nccl-test-job-73b0a388018-hc-gather-r3b0-worker
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[nccl-test-job-73b0a388018-hc-gather-r3b0-launcher:00001] 31 more processes have sent help message help-errmgr-base.txt / no-path
[nccl-test-job-73b0a388018-hc-gather-r3b0-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
bash: orted: command not found
bash: orted: command not found

FYI the NCCL tests work on <=64 instances with 8 GPUs.

When I was using V1 MPI operator its the same story. NCCL tests work on <=64 instances with 8 GPUs each. but fails for more than 64 instances. The error on v1 MPI operator.

+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-40 --+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-36 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&           PATH=/opt/amazon/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 37 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.0;tcp://192.168.235.32:34467" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated"
 /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&           PATH=/opt/amazon/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 41 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.0;tcp://192.168.235.32:34467" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated"
+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-58 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&           PATH=/opt/amazon/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 59 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.0;tcp://192.168.235.32:34467" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated"
+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-20 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&           PATH=/opt/amazon/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 21 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.0;tcp://192.168.235.32:34467" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated"
+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-60 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&           PATH=/opt/amazon/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 61 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.0;tcp://192.168.235.32:34467" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated"
+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-18 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&           PATH=/opt/amazon/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/amazon/openmpi/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 19 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.0;tcp://192.168.235.32:34467" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=nccl-test-job-e39b6804078-hc-gather-r3b0-worker-70
+ [ n = - ]
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts nccl-test-job-e39b6804078-hc-gather-r3b0-worker-70:/etc/hosts_of_nodes
/etc/mpi/kubexec.sh: 10: /opt/kube/kubectl: not found
+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-70 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&  orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 71 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.7;tcp://192.168.248.75:52711"
/etc/mpi/kubexec.sh: 11: /opt/kube/kubectl: not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
+ POD_NAME=nccl-test-job-e39b6804078-hc-gather-r3b0-worker-92
+ [ n = - ]
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts nccl-test-job-e39b6804078-hc-gather-r3b0-worker-92:/etc/hosts_of_nodes
/etc/mpi/kubexec.sh: 10: /opt/kube/kubectl: not found
+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-92 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&  orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 93 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.29;tcp://192.168.253.210:50199"
/etc/mpi/kubexec.sh: 11: /opt/kube/kubectl: not found
command terminated with exit code 1
command terminated with exit code 1
+ POD_NAME=nccl-test-job-e39b6804078-hc-gather-r3b0-worker-77
+ [ n = - ]
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts nccl-test-job-e39b6804078-hc-gather-r3b0-worker-77:/etc/hosts_of_nodes
/etc/mpi/kubexec.sh: 10: /opt/kube/kubectl: not found
+ /opt/kube/kubectl exec nccl-test-job-e39b6804078-hc-gather-r3b0-worker-77 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&  orted -mca ess "env" -mca ess_base_jobid "614072320" -mca ess_base_vpid 78 -mca ess_base_num_procs "111" -mca orte_hnp_uri "614072320.0;tcp://192.168.235.32:34467" --mca pml "^cm" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca orte_tag_output "1" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca rmaps_base_oversubscribe "1" -mca pmix "^s1,s2,cray,isolated" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "614072320.14;tcp://192.168.199.15:37235"
/etc/mpi/kubexec.sh: 11: /opt/kube/kubectl: not found
command terminated with exit code 1

If you see the Error msg in both the V1 and V2 cases are similar.

Please suggest how can I run NCCL tests on more that 64 instances 8 GPUs each. Here is the command for NCCL test that I run.

['/opt/amazon/openmpi/bin/mpirun', '--allow-run-as-root', '--tag-output', '-np', '576', '-bind-to', 'none', '-map-by', 'slot', '-x', 'PATH', '-x', 'LD_LIBRARY_PATH', '-x', 'XLA_FLAGS', '-x', 'TF_XLA_FLAGS', '-x', 'NCCL_DEBUG=INFO', '-x', 'NCCL_ALGO=RING', '-x', 'FI_EFA_USE_DEVICE_RDMA=1', '-x', 'RDMAV_FORK_SAFE=1', '--mca', 'pml', '^cm', '--oversubscribe', '/opt/nccl-tests/build/all_reduce_perf', '-b', '8', '--minbytes', '1000000000', '--maxbytes', '1000000000', '-f', '2', '-g', '1', '-g', '1', '-c', '1', '-n', '100']
wzamazon commented 1 year ago

Can you try add '--prefix', '/opt/amazon/openmpi' to the mpirun command?

Mayank12022 commented 1 year ago

The change you mentioned: command = ["mpirun", "--prefix", "/opt/amazon/openmpi", "--allow-run-as-root", ... Does not work for me. The container crashes.

(health_check_env) 88665a544570:mpi-operator mayankgp$ kubectl get pods | grep nccl-test-job-2c9b7b53b8a
nccl-test-job-2c9b7b53b8a-hc-gather-r3b0-launcher-mcp6f    0/1     RunContainerError   4          109s
nccl-test-job-2c9b7b53b8a-hc-gather-r3b0-worker-0          1/1     Running             0          109s
nccl-test-job-2c9b7b53b8a-hc-gather-r3b0-worker-1          1/1     Running             0          109s
wzamazon commented 1 year ago

what if you run something simpler like hostname. e.g.

['/opt/amazon/openmpi/bin/mpirun', '--prefix', '/opt/amazon/openmpi', '--allow-run-as-root', '--tag-output', '-np', '576', 'hostname']
alculquicondor commented 1 year ago

This is in the error message, have you tried that?

In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment.

IIUC, this disables branching (a worker communicating with a subset of workers, instead of all going through the driver).

@thoraxe I think you were able to run a large number of workers at some point? I seem to remember a parameter was needed.

Another reason might be that workers need to wait for other workers to be reachable, and for that you might need an entry point similar to this: https://github.com/kubeflow/mpi-operator/blob/master/build/base/intel-entrypoint.sh This comes from an offline discussion with @ArangoGutierrez

alculquicondor commented 1 year ago

A quick check to see whether the entrypoint would solve the issue is to increase the number of retries:

spec:
  runPolicy:
    backoffLimit: 100
ArangoGutierrez commented 1 year ago

You can try https://github.com/kubeflow/mpi-operator/pull/465 this patch and see if it works for you

Mayank12022 commented 1 year ago

Got it resolve by adding routed=direct

alculquicondor commented 1 year ago

Would you mind documenting this in the README for high number of workers?