open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.07k stars 844 forks source link

Errors when running mpi programs #12520

Open rafelamer opened 2 months ago

rafelamer commented 2 months ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

The version of openmpi is 5.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

It was installed on Fedora 40 hosts with the command dnf install openmpi openmpi-devel

I don't know if it is relevant, in Fedora 40 the openmpi library is linked to libfabric

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

I cannot run an mpi program on a 3-node cluster with ip addresses 195.201.223.246, 162.55.213.49 and 88.198.157.233 When I run

shell$ mpirun -np 16 --hostfile ~/hosts ./mpi02

I get errors of the form

mce-eseiaat.com:rank0:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=88.198.157.233/32 TCP=45575 UDP=54793) on a different subnet 88.198.157.233/32

mce-eseiaat.com:rank3:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=162.55.213.49/32 TCP=37449 UDP=41668) on a different subnet 162.55.213.49/32

mce-eseiaat.com:rank6:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=88.198.157.233/32 TCP=54837 UDP=47899) on a different subnet 88.198.157.233/32

mce-eseiaat.com:rank9:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=162.55.213.49/32 TCP=45297 UDP=39015) on a different subnet 162.55.213.49/32

mce-eseiaat.com:rank12:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=88.198.157.233/32 TCP=35391 UDP=33260) on a different subnet 88.198.157.233/32

mce-eseiaat.com:rank15:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=162.55.213.49/32 TCP=51183 UDP=52527) on a different subnet 162.55.213.49/32

worker2.mce-eseiaat.com:rank2:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=195.201.223.246/32 TCP=44503 UDP=49948) on a different subnet 195.201.223.246/32

worker2.mce-eseiaat.com:rank11:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=88.198.157.233/32 TCP=51443 UDP=36464) on a different subnet 88.198.157.233/32

worker2.mce-eseiaat.com:rank14:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=195.201.223.246/32 TCP=33317 UDP=40668) on a different subnet 195.201.223.246/32

worker1.mce-eseiaat.com:rank1:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=195.201.223.246/32 TCP=32781 UDP=53663) on a different subnet 195.201.223.246/32

worker1.mce-eseiaat.com:rank7:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=195.201.223.246/32 TCP=47383 UDP=51499) on a different subnet 195.201.223.246/32

worker2.mce-eseiaat.com:rank8:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=195.201.223.246/32 TCP=40591 UDP=38371) on a different subnet 195.201.223.246/32

worker1.mce-eseiaat.com:rank13:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=195.201.223.246/32 TCP=56965 UDP=38971) on a different subnet 195.201.223.246/32

worker1.mce-eseiaat.com:rank10:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=162.55.213.49/32 TCP=59499 UDP=38225) on a different subnet 162.55.213.49/32

worker2.mce-eseiaat.com:rank5:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=88.198.157.233/32 TCP=39797 UDP=49164) on a different subnet 88.198.157.233/32

worker1.mce-eseiaat.com:rank4:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=162.55.213.49/32 TCP=60669 UDP=38137) on a different subnet 162.55.213.49/32

The contents of the hosts file are

mce-eseiaat.com slots=8
worker1.mce-eseiaat.com slots=4
worker2.mce-eseiaat.com slots=4

Best regards, Rafel Amer

wenduwan commented 2 months ago

Is mpi02 on a shared NFS volume? It would be helpful to double check the linking

ldd mpi02

I don't know if it is relevant, in Fedora 40 the openmpi library is linked to libfabric

We can rule out libfabric with additional mca parameters

mpirun -np 16 --mca pml ob1 --mca btl tcp,self --hostfile ~/hosts ./mpi02

This prevents libfabric from being used.

rafelamer commented 2 months ago

Hi,

with this command mpirun -np 16 --mca pml ob1 --mca btl tcp,self --hostfile ~/hosts ./mpi02 it works fine. So, it seems that the problem is related to libfabric.

Thanks, Rafel Amer

wenduwan commented 2 months ago

Thanks for checking. Just to clarify, do you intend to use libfabric at all?

I wonder how libfabric is configured on your system - we can move the discussion to the libfabric community if you desire so.

$ dnf list installed | grep libfabric
$ dnf info <libfabric package name>
rafelamer commented 2 months ago

OK, I will subscribe to the Libfabric-users mailing list and then, I will make a post.

Best regards, Rafel Amer

wenduwan commented 2 months ago

The libfabric community would need more information to investigate the issue.

As a starter, you can turn on the relevant verbose configurations in mpirun

--mca btl_ofi_verbose 1 -x FI_LOG_LEVEL=info