open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

OpenIB wireup fails when there is an IP alias #65

Closed ompiteam closed 3 years ago

ompiteam commented 10 years ago

Whilst testing for ticket 1505, I attempted the same setup replacing the tcp btl with openib. To which I get the following:

$ mpirun --host r1-rdma,r1-iw,r2-rdma,r2-iw --mca btl openib,sm,self /opt/ompi/openmpi-cpc2-install/tests/IMB-3.0/IMB-MPI1 pingpong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.0, MPI-1 part  
#---------------------------------------------------
# Date                  : Tue Nov 11 10:17:14 2008
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.20.6
# Version               : #1 SMP Fri Apr 6 14:03:16 PDT 2007
# MPI Version           : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

#
# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# PingPong
[r2-iw][[29416,1],2][connect/btl_openib_connect_rdmacm.c:1385:finish_connect] rdma_connect Failed with -1
[r2-iw][[29416,1],3][connect/btl_openib_connect_rdmacm.c:1385:finish_connect] rdma_connect Failed with -1
--------------------------------------------------------------------------
mpirun has exited due to process rank 3 with PID 7835 on
node r2-iw exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
ompiteam commented 10 years ago

Imported from trac issue 1665. Created by jdmason on 2008-11-11T13:31:25, last modified: 2008-12-03T09:44:10

ompiteam commented 10 years ago

Trac comment by jdmason on 2008-11-11 13:37:04:

I do not believe this is necessary for 1.3.0. After adding some additional verbosity to debug, it appears that the two interfaces on the same system are trying to communicate with each other. See output below:

$ mpirun --host r1-rdma,r1-iw,r2-rdma,r2-iw --mca btl openib,sm,self /opt/ompi/openmpi-cpc2-install/tests/IMB-3.0/IMB-MPI1 pingpong\ [r1-iw][[29380,1],0][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7cb520 (0x7b8ae0), I still am NOT the initiator to r2-rdma\ [r1-iw][[29380,1],1][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7cbea0 (0x7b8ae0), I still am NOT the initiator to r2-iw\ [r2-iw][[29380,1],2][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7c67a0 (0x7b8ae0), I still am the initiator to r1-iw\ [r2-iw][[29380,1],3][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7c71c0 (0x7b8ae0), I still am the initiator to r1-iw\ [r1-iw][[29380,1],0][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7dcdd0 (0x7b9190), I still am NOT the initiator to r2-rdma\ [r1-iw][[29380,1],1][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7dd4c0 (0x7b9190), I still am NOT the initiator to r2-iw\ [r2-iw][[29380,1],2][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7d8ce0 (0x7b9190), I still am the initiator to r1-iw\ [r2-iw][[29380,1],3][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7d9470 (0x7b9190), I still am the initiator to r1-iw\

---------------------------------------------------\

Intel (R) MPI Benchmark Suite V3.0, MPI-1 part \

---------------------------------------------------\

Date : Tue Nov 11 10:12:18 2008\

Machine : x86_64\

System : Linux\

Release : 2.6.20.6\

Version : https://svn.open-mpi.org/trac/ompi/ticket/1 SMP Fri Apr 6 14:03:16 PDT 2007\

MPI Version : 2.0\

MPI Thread Environment: MPI_THREAD_SINGLE\

\

\

Minimum message length in bytes: 0\

Maximum message length in bytes: 4194304\

\

MPI_Datatype : MPI_BYTE \

MPI_Datatype for reductions : MPI_FLOAT\

MPI_Op : MPI_SUM \

\

\

\

List of Benchmarks to run:\

\

PingPong\

[r2-iw][[29380,1],2][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7c7ba0 (0x7b8ae0), I still am NOT the initiator to r2-iw\ [r2-iw][[29380,1],2][connect/btl_openib_connect_rdmacm.c:1385:finish_connect] rdma_connect Failed with -1\ [r2-iw][[29380,1],3][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7c7ba0 (0x7b8ae0), I still am the initiator to r2-rdma\ [r2-iw][[29380,1],3][connect/btl_openib_connect_rdmacm.c:1385:finish_connect] rdma_connect Failed with -1\ --------------------------------------------------------------------------\ mpirun has exited due to process rank 2 with PID 7684 on\ node r2-rdma exiting without calling "finalize". This may\ have caused other processes in the application to be\

terminated by signals sent by mpirun (as reported here).\

ompiteam commented 10 years ago

Trac comment by jdmason on 2008-11-11 14:06:41:

Reposting console output from my previous post

[ompi@r1-iw ompi-trunk]$ mpirun --host r1-rdma,r1-iw,r2-rdma,r2-iw --mca btl openib,sm,self /opt/ompi/openmpi-cpc2-install/tests/IMB-3.0/IMB-MPI1 pingpong
[r1-iw][[29380,1],0][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7cb520 (0x7b8ae0), I still am NOT the initiator to r2-rdma
[r1-iw][[29380,1],1][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7cbea0 (0x7b8ae0), I still am NOT the initiator to r2-iw
[r2-iw][[29380,1],2][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7c67a0 (0x7b8ae0), I still am the initiator to r1-iw
[r2-iw][[29380,1],3][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7c71c0 (0x7b8ae0), I still am the initiator to r1-iw
[r1-iw][[29380,1],0][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7dcdd0 (0x7b9190), I still am NOT the initiator to r2-rdma
[r1-iw][[29380,1],1][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7dd4c0 (0x7b9190), I still am NOT the initiator to r2-iw
[r2-iw][[29380,1],2][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7d8ce0 (0x7b9190), I still am the initiator to r1-iw
[r2-iw][[29380,1],3][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7d9470 (0x7b9190), I still am the initiator to r1-iw
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.0, MPI-1 part    
#---------------------------------------------------
# Date                  : Tue Nov 11 10:12:18 2008
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.20.6
# Version               : #1 SMP Fri Apr 6 14:03:16 PDT 2007
# MPI Version           : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

#
# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# PingPong
[r2-iw][[29380,1],2][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7c7ba0 (0x7b8ae0), I still am NOT the initiator to r2-iw
[r2-iw][[29380,1],2][connect/btl_openib_connect_rdmacm.c:1385:finish_connect] rdma_connect Failed with -1
[r2-iw][[29380,1],3][connect/btl_openib_connect_rdmacm.c:1382:finish_connect] SERVICE in finish_connect; ep=0x7c7ba0 (0x7b8ae0), I still am the initiator to r2-rdma
[r2-iw][[29380,1],3][connect/btl_openib_connect_rdmacm.c:1385:finish_connect] rdma_connect Failed with -1
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 7684 on
node r2-rdma exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
ompiteam commented 10 years ago

Trac comment by jdmason on 2008-11-11 15:06:29:

More debug was added to add_rdma_addr in ompi/mca/btl/openib/btl_openib_iwarp.c, where it was noticed that there are multiple addresses/subnets being added for each ib_dev. When mca_btl_openib_get_iwarp_subnet_id is called, it simply gets the first address/subnet found for the given device. Unfortunately, there seems to be no way to know intended IP address/subnet it is looking for only given the ib_dev. More investigation is needed to determine if there is a way to find out which one it is looking for. In the meantime, something should be added to the release notes noting the limitation for only one IP Address per physical port.

ompiteam commented 10 years ago

Trac comment by jdmason on 2008-11-12 15:09:22:

A patch soon to be applied to prevent IP Aliasing until this ticket is resolved will prevent this from being an issue in OpenMPI 1.3.0. However it raised an interesting issue. Since the openib btl is device/port based, it will add all the IP addresses pertaining to each device/port. For IP alaising, this could be 2+ address per port/adapter. There should be a btl parameter to include/exclude certain IP addresses, but there is not. This should be added when the problem above is fixed (and might be a better workaround then simply excluding all devices with multiple IP addresses).

ompiteam commented 10 years ago

Trac comment by jsquyres on 2008-11-14 13:06:59:

It turns out that Chelsio needs at least part of this fix for v1.3. We came to a compromise:

  1. Supporting IP aliasing can be put off to v1.3.1 or later (see if any real-world users care). If the openib BTL detects that it adds the same !OpenFabrics port more than once, then we'll show_help a friendly message and abort the job.
  2. Add 2 new MCA params for including/excluding specific IP interfaces by means of a (a.b.c.d/e) specification. This allows running on devices that have IP aliases, but only when specifying which (single) network interface on that device to use.

Working on a patch for that right now.

ompiteam commented 10 years ago

Trac comment by jdmason on 2008-11-17 14:14:12:

While working on the compromise mentioned above, I determined the cause of the IP Alias problem for iWARP. The netmask was being used improperly when determining which subnet each connection is on. With this bug being corrected, the IP Alias'ed connections work. Patch is forthcoming.

ompiteam commented 10 years ago

Trac comment by jdmason on 2008-11-19 14:55:35:

Correction, the IP Alias issue is still there. The problem still lies in the mca_btl_openib_get_iwarp_subnet_id suppling the first address/subnet found for the given device. The patch I committed to trunk simple fixes the subnet to be a legit value.

ompiteam commented 10 years ago

Trac comment by jsquyres on 2008-11-25 10:57:32:

I believe that r20016 also needs collision detection to ensure that the user only specifies one of:

ompiteam commented 10 years ago

Trac comment by jsquyres on 2008-12-02 12:40:39:

Jon and I talked about this a bunch. We have decided that OMPI's default behavior is actually good and reasonable; with the ipaddr_[in|ex]clude MCA params, users can effect whatever subnet routing they want.

So Jon's going to add the check to ensure only 1 of the 4 MCA params is specified, and we're going to leave the rest of the functionality the same. Then we'll add a FAQ item about these new parameters and how/when you might want to use them.

ompiteam commented 10 years ago

Trac comment by timattox on 2008-12-03 09:44:10:

(In [20059]) Closes #1674, Refs #1665: iWARP subnet fix

Submitted by jdmason, Reviewed by jsquyres, RM-approved by bbenton

r20016: This patch consists of two parts. Part one is the fixing of a bug in the determing of the IP subnet. The netmask was being used improperly when determining which subnet each connection is on. Part two is the ability to include/exclude specific subnets.

r20052: This commit adds comments regarding IP Aliases and the default behavior when determining which IP address to use when transmitting data. Also it adds logic to prevent usage of more than one of the btl_openib_if_include, btl_openib_if_exclude, btl_openib_ipaddr_include, or btl_openib_ipaddr_exclude MCA parameters.

r20053: Gracefully handle NULL strings when calling orte_show_help for preventing usage of more than one of the btl_openib_if_include, btl_openib_if_exclude, btl_openib_ipaddr_include, or btl_openib_ipaddr_exclude MCA parameters.

awlauria commented 3 years ago

Openib btl is removed. Closing.