open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 858 forks source link

Add support for 50G and 100G adapters in openib btl #3431

Open dsharma283 opened 7 years ago

dsharma283 commented 7 years ago

Thank you for taking the time to submit an issue!

Background information

If openmpi is used against HDR link speeds the test fails to run because openib btl does not have support for HDR link speed and 1x link width.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

openmpi-2.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

standard distribution tar ball with --prefix=/usr/local/mpi/openmpi

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I am trying to run IMB using openmpi-2.0.1/2.1.0 on a 50G 2-node cluster in my lab, but the test does not start. it fails with following error:

Starting for 0 th iteration. Using openmpi
LOGPATH: /MPI/Logs/openmpi/imb/runlog-openmpi-np6-n2-0
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   calypso-rhel73GA
  Local device: bnxt_re0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[25467,1],0]) is on host: calypso-rhel73GA
  Process 2 ([[25467,1],1]) is on host: pandora-rhel73GA
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[calypso-rhel73GA:12532] *** An error occurred in MPI_Bcast
[calypso-rhel73GA:12532] *** reported by process [140683322785793,0]
[calypso-rhel73GA:12532] *** on communicator MPI_COMM_WORLD
[calypso-rhel73GA:12532] *** MPI_ERR_INTERN: internal error
[calypso-rhel73GA:12532] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[calypso-rhel73GA:12532] ***    and potentially your MPI job)
*** Error in `/usr/local/imb/openmpi/dcheck/IMB-MPI1': free(): invalid
pointer: 0x00007ff37b2f34d8 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7c503)[0x7ff37a9ac503]
/usr/local/mpi/openmpi/lib/libmpi.so.20(+0x58d17)[0x7ff37af65d17]
/usr/local/mpi/openmpi/lib/libmpi.so.20(ompi_mpi_errors_are_fatal_comm_handler+0x105)[0x7ff37af66485]
/usr/local/mpi/openmpi/lib/libmpi.so.20(ompi_errhandler_invoke+0x115)[0x7ff37af659c5]
/usr/local/mpi/openmpi/lib/libmpi.so.20(MPI_Bcast+0x1a3)[0x7ff37af86743]
/usr/local/imb/openmpi/dcheck/IMB-MPI1[0x402dd7]
/usr/local/imb/openmpi/dcheck/IMB-MPI1[0x401e0b]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff37a951b35]
/usr/local/imb/openmpi/dcheck/IMB-MPI1[0x402744]
======= Memory map: ========
00400000-00415000 r-xp 00000000 fd:00 33970734
  /usr/local/imb/openmpi/dcheck/IMB-MPI1
00614000-00615000 r--p 00014000 fd:00 33970734
  /usr/local/imb/openmpi/dcheck/IMB-MPI1
00615000-00616000 rw-p 00015000 fd:00 33970734
  /usr/local/imb/openmpi/dcheck/IMB-MPI1
00616000-0061a000 rw-p 00000000 00:00 0
00f2c000-01071000 rw-p 00000000 00:00 0                                  [heap]
7ff35ffff000-7ff368000000 rw-s 00000000 fd:00 17524899
  /tmp/openmpi-sessions-0@calypso-rhel73GA_0/25467/1/shared_mem_pool.calypso-rhel73GA
(deleted)
7ff368000000-7ff368021000 rw-p 00000000 00:00 0
7ff368021000-7ff36c000000 ---p 00000000 00:00 0
7ff36c000000-7ff36c021000 rw-p 00000000 00:00 0
7ff36c021000-7ff370000000 ---p 00000000 00:00 0
7ff370000000-7ff370021000 rw-p 00000000 00:00 0
7ff370021000-7ff374000000 ---p 00000000 00:00 0
7ff374698000-7ff37469e000 r-xp 00000000 fd:00 51165705
  /usr/local/lib/libbnxtre-rdmav2.so
7ff37469e000-7ff37489d000 ---p 00006000 fd:00 51165705
  /usr/local/lib/libbnxtre-rdmav2.so
7ff37489d000-7ff37489e000 r--p 00005000 fd:00 51165705
  /usr/local/lib/libbnxtre-rdmav2.so
7ff37489e000-7ff37489f000 rw-p 00006000 fd:00 51165705
  /usr/local/lib/libbnxtre-rdmav2.so
7ff37489f000-7ff3748a4000 r-xp 00000000 fd:00 252310528
  /usr/lib64/libibverbs/libcxgb3-rdmav2.so
7ff3748a4000-7ff374aa3000 ---p 00005000 fd:00 252310528
  /usr/lib64/libibverbs/libcxgb3-rdmav2.so
7ff374aa3000-7ff374aa4000 r--p 00004000 fd:00 252310528
  /usr/lib64/libibverbs/libcxgb3-rdmav2.so
7ff374aa4000-7ff374aa5000 rw-p 00005000 fd:00 252310528
  /usr/lib64/libibverbs/libcxgb3-rdmav2.so
7ff374aa5000-7ff374aac000 r-xp 00000000 fd:00 252310529
  /usr/lib64/libibverbs/libcxgb4-rdmav2.so
7ff374aac000-7ff374cab000 ---p 00007000 fd:00 252310529
  /usr/lib64/libibverbs/libcxgb4-rdmav2.so
7ff374cab000-7ff374cac000 r--p 00006000 fd:00 252310529
  /usr/lib64/libibverbs/libcxgb4-rdmav2.so
7ff374cac000-7ff374cad000 rw-p 00007000 fd:00 252310529
  /usr/lib64/libibverbs/libcxgb4-rdmav2.so
7ff374cad000-7ff374cb1000 r-xp 00000000 fd:00 252310530
  /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so
7ff374cb1000-7ff374eb0000 ---p 00004000 fd:00 252310530
  /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so
7ff374eb0000-7ff374eb1000 r--p 00003000 fd:00 252310530
  /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so
7ff374eb1000-7ff374eb2000 rw-p 00004000 fd:00 252310530
  /usr/lib64/libibverbs/libhfi1verbs-rdmav2.so
7ff374eb2000-7ff374eb7000 r-xp 00000000 fd:00 252310531
  /usr/lib64/libibverbs/libhns-rdmav2.so
7ff374eb7000-7ff3750b6000 ---p 00005000 fd:00 252310531
  /usr/lib64/libibverbs/libhns-rdmav2.so
7ff3750b6000-7ff3750b7000 r--p 00004000 fd:00 252310531
  /usr/lib64/libibverbs/libhns-rdmav2.so
7ff3750b7000-7ff3750b8000 rw-p 00005000 fd:00 252310531
  /usr/lib64/libibverbs/libhns-rdmav2.so
7ff3750b8000-7ff3750be000 r-xp 00000000 fd:00 252310532
  /usr/lib64/libibverbs/libi40iw-rdmav2.so
7ff3750be000-7ff3752be000 ---p 00006000 fd:00 252310532
/usr/lib64/libibverbs/libi40iw-rdmav2.so
7ff3750be000-7ff3752be000 ---p 00006000 fd:00 252310532
  /usr/lib64/libibverbs/libi40iw-rdmav2.so
7ff3752be000-7ff3752bf000 r--p 00006000 fd:00 252310532
  /usr/lib64/libibverbs/libi40iw-rdmav2.so
7ff3752bf000-7ff3752c0000 rw-p 00007000 fd:00 252310532
  /usr/lib64/libibverbs/libi40iw-rdmav2.so
7ff3752c0000-7ff3752c4000 r-xp 00000000 fd:00 252310533
  /usr/lib64/libibverbs/libipathverbs-rdmav2.so
7ff3752c4000-7ff3754c3000 ---p 00004000 fd:00 252310533
  /usr/lib64/libibverbs/libipathverbs-rdmav2.so
7ff3754c3000-7ff3754c4000 r--p 00003000 fd:00 252310533
  /usr/lib64/libibverbs/libipathverbs-rdmav2.so
7ff3754c4000-7ff3754c5000 rw-p 00004000 fd:00 252310533
  /usr/lib64/libibverbs/libipathverbs-rdmav2.so
7ff3754c5000-7ff3754cd000 r-xp 00000000 fd:00 252310534
  /usr/lib64/libibverbs/libmlx4-rdmav2.so
7ff3754cd000-7ff3756cc000 ---p 00008000 fd:00 252310534
  /usr/lib64/libibverbs/libmlx4-rdmav2.so
7ff3756ce000-7ff3756e5000 r-xp 00000000 fd:00 252310535
  /usr/lib64/libibverbs/libmlx5-rdmav2.so
7ff3756e5000-7ff3758e4000 ---p 00017000 fd:00 252310535
  /usr/lib64/libibverbs/libmlx5-rdmav2.so
7ff3758e4000-7ff3758e5000 r--p 00016000 fd:00 252310535
  /usr/lib64/libibverbs/libmlx5-rdmav2.so
7ff3758e5000-7ff3758e6000 rw-p 00017000 fd:00 252310535
  /usr/lib64/libibverbs/libmlx5-rdmav2.so
7ff3758e6000-7ff3758ee000 r-xp 00000000 fd:00 252310536
  /usr/lib64/libibverbs/libmthca-rdmav2.so
7ff3758ee000-7ff375aed000 ---p 00008000 fd:00 252310536
  /usr/lib64/libibverbs/libmthca-rdmav2.so
7ff375aed000-7ff375aee000 r--p 00007000 fd:00 252310536
  /usr/lib64/libibverbs/libmthca-rdmav2.so
7ff375aee000-7ff375aef000 rw-p 00008000 fd:00 252310536
  /usr/lib64/libibverbs/libmthca-rdmav2.so
7ff375aef000-7ff375af4000 r-xp 00000000 fd:00 252310537
  /usr/lib64/libibverbs/libnes-rdmav2.so
7ff375af4000-7ff375cf3000 ---p 00005000 fd:00 252310537
  /usr/lib64/libibverbs/libnes-rdmav2.so
7ff375cf3000-7ff375cf4000 r--p 00004000 fd:00 252310537
  /usr/lib64/libibverbs/libnes-rdmav2.so
7ff375cf4000-7ff375cf5000 rw-p 00005000 fd:00 252310537
  /usr/lib64/libibverbs/libnes-rdmav2.so
7ff375cf5000-7ff375cfb000 r-xp 00000000 fd:00 252310538
  /usr/lib64/libibverbs/libocrdma-rdmav2.so
7ff375cfb000-7ff375efa000 ---p 00006000 fd:00 252310538
  /usr/lib64/libibverbs/libocrdma-rdmav2.so
7ff375efa000-7ff375efb000 r--p 00005000 fd:00 252310538
  /usr/lib64/libibverbs/libocrdma-rdmav2.so[calypso-rhel73GA:12532]
*** Process received signal ***
[calypso-rhel73GA:12532] Signal: Aborted (6)
[calypso-rhel73GA:12532] Signal code:  (-6)
[calypso-rhel73GA:12532] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7ff37ad00370]
[calypso-rhel73GA:12532] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ff37a9651d7]
[calypso-rhel73GA:12532] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ff37a9668c8]
[calypso-rhel73GA:12532] [ 3] /lib64/libc.so.6(+0x74f07)[0x7ff37a9a4f07]
[calypso-rhel73GA:12532] [ 4] /lib64/libc.so.6(+0x7c503)[0x7ff37a9ac503]
[calypso-rhel73GA:12532] [ 5]
/usr/local/mpi/openmpi/lib/libmpi.so.20(+0x58d17)[0x7ff37af65d17]
[calypso-rhel73GA:12532] [ 6]
/usr/local/mpi/openmpi/lib/libmpi.so.20(ompi_mpi_errors_are_fatal_comm_handler+0x105)[0x7ff37af66485]
[calypso-rhel73GA:12532] [ 7]
/usr/local/mpi/openmpi/lib/libmpi.so.20(ompi_errhandler_invoke+0x115)[0x7ff37af659c5]
[calypso-rhel73GA:12532] [ 8]
/usr/local/mpi/openmpi/lib/libmpi.so.20(MPI_Bcast+0x1a3)[0x7ff37af86743]
[calypso-rhel73GA:12532] [ 9] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x402dd7]
[calypso-rhel73GA:12532] [10] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x401e0b]
[calypso-rhel73GA:12532] [11]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff37a951b35]
[calypso-rhel73GA:12532] [12] /usr/local/imb/openmpi/dcheck/IMB-MPI1[0x402744]
[calypso-rhel73GA:12532] *** End of error message ***

Following are the run-time parameters I used:

mpirun -np 6 -hostfile./hostfile --mca btl openib,self,sm --mca
btl_openib_receive_queues P,65536,256,192,128 -mca
btl_openib_cpc_include rdmacm -mca pml ob1 --allow-run-as-root
--bind-to none --map-by node /usr/local/imb/openmpi/IMB-MPI1
Following are the entries in the .ini file (just for reference) :
vendor_id = 0x14e4
vendor_part_id = 0x16d7
use_eager_rdma = 1
mtu = 1024
receive_queues = P,65536,256,192,128
max_inline_data = 96

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world
jsquyres commented 7 years ago

This issue is assumedly a followup / detailed justification for #3434.