open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

Cannot use SHARP with Singularity container #10045

Open vitduck opened 2 years ago

vitduck commented 2 years ago

What version of Open MPI are you using?

v4.1.1

Describe how Open MPI was installed

 Configure command line:
 '--prefix=/apps/compiler/gcc/4.8.5/mpi/openmpi/4.1.1'
 '--enable-dlopen' '--enable-binaries'
 '--enable-mpirun-prefix-by-default'
 '--enable-mpi-fortran' '--enable-mpi-cxx'
 '--enable-mpi-cxx-seek' '--enable-oshmem'
 '--enable-oshmem-compat' '--enable-oshmem-profile'
 '--enable-oshmem-fortran' '--enable-shared'
 '--enable-static' '--enable-wrapper-rpath'
 '--enable-openib-rdmacm'
 '-enable-openib-rdmacm-ibaddr' '--with-slurm=/usr'
 '--with-pmi=/usr'
 '--with-io-romio-flags=-with-file-system=lustre'
 '--with-lustre=/usr' '--with-hwloc=/usr'
 '--with-hwloc-libdir=/usr/lib64'
 '--with-hcoll=/opt/mellanox/hcoll' '--with-pic'

Please describe the system on which you are running

I suspect that it is not possible for SHARP to pass through the container's sand box.

jladd-mlnx commented 2 years ago

@bureddy FYI. Hi, can you, please, add the output from the run where SHARP works ("SHARP does work if we build CPU version of HPL from source.") ? Which ConnectX devices are you running over?

To validate that SHARP is indeed up and performing as expected, I would recommend starting out with a microbenchmark like OSU Allreduce and running it with and without SHARP enabled. HPL won't exercise SHARP.

bureddy commented 2 years ago

The container's OpenMPI was built with UCX but not HCOLL.

@vitduck SHARP is enabled through HCOLL. Can you build OMPI with HCOLL?

vitduck commented 2 years ago

@jladd-mlnx Hi, I attached here the output of HPL with SHARP_COLL_LOG_LEVEL=4. I've tested three variants of HPL with SHARP so far.

  1. src build
  2. AMD-optimized HPL binary (https://developer.amd.com/amd-aocl/blas-library)
  3. Singularity container from NGC (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks)

Both src-build and binary versions produce debug messages correctly sandwiching HPL's output.
For singularity there is no debug message so I am not sure if SHARP collectives are utilized. HPL-n1-g16-t4.txt

We are running ConnectX-6

0e:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

As you suggested, the OSU benchmarks did show improvement bandwidths when SHARP was enabled.

HPL won't exercise SHARP.

The bottleneck of HPL is BLAS so I agree that the effect of SHARP would be negligible. What puzzled me was the lack of debug messages.

@bureddy Unfortunately the container was provided as-is by NVIDIA. Do you mean that I should replace the container's MPI with hcoll-enabled one ?

The host/outer MPI is responsible for orchestrating singularity processes: host-MPI -> orted -> singularity -> container-MPI -> xhpl. Regardless, I think that some debug information such as SHARP initialization and finalization should be present as shown in the attached log file.