Open vitduck opened 2 years ago
@bureddy FYI. Hi, can you, please, add the output from the run where SHARP works ("SHARP does work if we build CPU version of HPL from source.") ? Which ConnectX devices are you running over?
To validate that SHARP is indeed up and performing as expected, I would recommend starting out with a microbenchmark like OSU Allreduce and running it with and without SHARP enabled. HPL won't exercise SHARP.
The container's OpenMPI was built with UCX but not HCOLL.
@vitduck SHARP is enabled through HCOLL. Can you build OMPI with HCOLL?
@jladd-mlnx
Hi, I attached here the output of HPL with SHARP_COLL_LOG_LEVEL=4
. I've tested three variants of HPL with SHARP so far.
Both src-build and binary versions produce debug messages correctly sandwiching HPL's output.
For singularity there is no debug message so I am not sure if SHARP collectives are utilized.
HPL-n1-g16-t4.txt
We are running ConnectX-6
0e:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
As you suggested, the OSU benchmarks did show improvement bandwidths when SHARP was enabled.
HPL won't exercise SHARP.
The bottleneck of HPL is BLAS so I agree that the effect of SHARP would be negligible. What puzzled me was the lack of debug messages.
@bureddy Unfortunately the container was provided as-is by NVIDIA. Do you mean that I should replace the container's MPI with hcoll-enabled one ?
The host/outer MPI is responsible for orchestrating singularity processes: host-MPI -> orted -> singularity -> container-MPI -> xhpl. Regardless, I think that some debug information such as SHARP initialization and finalization should be present as shown in the attached log file.
What version of Open MPI are you using?
v4.1.1
Describe how Open MPI was installed
Please describe the system on which you are running
Network type: Mellanox Infiniband
Details of the problem
I am testing the performance of SHARP's with the HPL singularity container from NGC:
I suspect that it is not possible for SHARP to pass through the container's sand box.