hpcugent / vsc-mympirun

mympirun is a tool to facilitate running MPI programs on an HPC cluster
GNU General Public License v2.0
6 stars 9 forks source link

disable libfabric (OFI) btl + mtl by default when using OpenMPI (HPC-11091) #197

Closed boegel closed 8 months ago

boegel commented 8 months ago

Motivation for this are the inconsistent errors "Failed to modify UD QP to INIT on mlx5_0: Operation not permitted" that we have been seeing after updating to OFED 23.10.

Worth noting, same can be achieved in contexts where mympirun is not used via:

export OMPI_MCA_btl='^uct,ofi'
export OMPI_MCA_pml='ucx'
export OMPI_MCA_mtl='^ofi'
wdpypere commented 8 months ago

LGTM, very minor remarks.

boegel commented 8 months ago

Don't merge this yet please, we should actually test first that the workaround implemented here fixes the issues...

wdpypere commented 8 months ago

Don't merge this yet please, we should actually test first that the workaround implemented here fixes the issues...

ok

boegel commented 8 months ago

@hajgato is testing this currently, seems to work as designed for me with a quick test (MPI hello on top of OpenMPI)

wdpypere commented 8 months ago

@hajgato is testing this currently, seems to work as designed for me with a quick test (MPI hello on top of OpenMPI)

so this can be merged?

hajgato commented 8 months ago

@wdpypere Yes, tests were passed on Tier1