ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
551 stars 376 forks source link

'handle->state == RXM_CMAP_CONNREQ_SENT' assert failed when run IMB #6891

Closed oleotiger closed 2 years ago

oleotiger commented 3 years ago

I'm running intel IMB based on verbs with Huawei network-card Hi1822 . OS : centos 7.6 3.10.0-957.el7.x86_64 libfabric : 1.12.0 intel mpi : 2021.2 infomation about network card: attached bellow ( the 3rd reply)

I compiled libfabric with debug enabled to debug this problem.

export FI_PROVIDER_PATH=/opt/x86_64/x00513100/libs/libfabric-debug/lib/libfabric
export LD_LIBRARY_PATH=/opt/x86_64/x00513100/libs/libfabric-debug/lib:$LD_LIBRARY_PATH
export FI_LOG_LEVEL=debug

The running command: mpirun -hostfile /root/host2 -ppn 1 -genv FI_PROVIDER=verbs /opt/x86_64/libs/compiler/intel/21.2.0/mpi/2021.2.0/bin/IMB-MPI1

After running this command, one process aborts due to IMB-MPI1: prov/rxm/src/rxm_conn.c:510: rxm_cmap_process_connect: Assertionhandle->state == RXM_CMAP_CONNREQ_SENT' failed.`

Why I got this exception? How could I resolve it ? Is it the problem of libfabric or the driver of network card or anything else?

Following the the warn level log:

libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_1: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_1: there are no active ports
libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_1: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_1: there are no active ports
libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_1: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_1: there are no active ports
libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_2: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_2: there are no active ports
libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_2: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_2: there are no active ports
libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_2: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_2: there are no active ports
libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_3: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_3: there are no active ports
libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_3: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_3: there are no active ports
libfabric:24595:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_3: there are no active ports
libfabric:23352:verbs:fabric:vrb_get_device_attrs():614<warn> device hrn0_3: there are no active ports
libfabric:24595:psm3:core:psmx3_update_hfi_info():342<warn> NIC 1 STATE = INACTIVE
libfabric:23352:psm3:core:psmx3_update_hfi_info():342<warn> NIC 1 STATE = INACTIVE
libfabric:24595:psm3:core:psmx3_update_hfi_info():342<warn> NIC 2 STATE = INACTIVE
libfabric:23352:psm3:core:psmx3_update_hfi_info():342<warn> NIC 2 STATE = INACTIVE
libfabric:24595:psm3:core:psmx3_update_hfi_info():342<warn> NIC 3 STATE = INACTIVE
libfabric:23352:psm3:core:psmx3_update_hfi_info():342<warn> NIC 3 STATE = INACTIVE
libfabric:24595:ofi_rxm:av:util_verify_av_attr():504<warn> Shared AV is unsupported
libfabric:23352:ofi_rxm:av:util_verify_av_attr():504<warn> Shared AV is unsupported
#----------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2021.2, MPI-1 part
#----------------------------------------------------------------
# Date                  : Mon Jun 28 22:02:29 2021
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-957.el7.x86_64
# Version               : #1 SMP Thu Nov 8 23:39:32 UTC 2018
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /opt/x86_64/z00603408/libs/compiler/intel/21.2.0/mpi/2021.2.0/bin/IMB-MPI1

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_local
# Reduce_scatter
# Reduce_scatter_block
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier
libfabric:23352:verbs:eq:vrb_set_rnr_timer():475<warn> Unable to modify QP attribute
IMB-MPI1: prov/rxm/src/rxm_conn.c:510: rxm_cmap_process_connect: Assertion `handle->state == RXM_CMAP_CONNREQ_SENT' failed.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 23352 RUNNING AT 6248-node77
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================
oleotiger commented 3 years ago

The detailed log in debug level(last 100 lines):

libfabric:21098:sockets:core:ofi_get_list_of_addr():1407<info> Available addr: 150.1.68.77, iface name: enp61s0, speed: 25000
libfabric:21098:sockets:core:ofi_get_list_of_addr():1407<info> Available addr: 16.16.1.77, iface name: enp177s0, speed: 25000
libfabric:79126:sockets:core:ofi_get_list_of_addr():1407<info> Available addr: fe80::e83b:4ed6:f5:fc1f, iface name: enp177s0, speed: 25000
libfabric:21098:sockets:core:ofi_get_list_of_addr():1407<info> Available addr: fe80::7e9c:ab7f:3930:d669, iface name: enp61s0, speed: 25000
libfabric:79126:sockets:core:ofi_get_list_of_addr():1407<info> Available addr: fe80::3b41:4688:d197:9e38, iface name: enp179s0, speed: 25000
libfabric:79126:sockets:core:ofi_get_list_of_addr():1407<info> Available addr: fe80::c7a8:9fbf:1f6e:ce90, iface name: enp181s0, speed: 25000
libfabric:79126:sockets:core:ofi_insert_loopback_addr():1250<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:79126:sockets:core:ofi_insert_loopback_addr():1265<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:79126:sockets:core:util_getinfo_ifs():318<info> Chosen addr for using: 150.1.68.79, speed 25000
libfabric:79126:core:core:fi_getinfo_():1033<debug> fi_getinfo: provider sockets returned success
libfabric:79126:ofi_mrail:fabric:mrail_get_core_info():289<info> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:21098:sockets:core:ofi_get_list_of_addr():1407<info> Available addr: fe80::e2e9:6e36:8dfb:a465, iface name: enp177s0, speed: 25000
libfabric:79126:core:core:fi_getinfo_():1021<info> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:79126:core:core:fi_getinfo_():987<debug> hints prov_name: verbs;^ofi_rxm
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 hrn0_0
libfabric:79126:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 hrn0_0-xrc
libfabric:79126:verbs:fabric:vrb_get_matching_info():1531<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 hrn0_0-dgram
libfabric:79126:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0-dgram
libfabric:79126:core:core:fi_getinfo_():1033<debug> fi_getinfo: provider verbs returned success
libfabric:79126:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_rxd
libfabric:79126:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_mrail
libfabric:79126:core:core:fi_fabric_():1220<info> Opened fabric: IB-0xfe80000000000000
libfabric:79126:core:core:fi_fabric_():1220<info> Opened fabric: IB-0xfe80000000000000
libfabric:79126:ofi_rxm:core:fi_param_get_():299<info> read bool var use_srx=0
libfabric:79126:ofi_rxm:core:ofi_get_core_info():296<debug> --- Begin ofi_get_core_info ---
libfabric:79126:core:core:fi_getinfo_():987<debug> hints prov_name: verbs;^ofi_rxm
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 hrn0_0
libfabric:79126:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 hrn0_0-xrc
libfabric:79126:verbs:core:vrb_check_hints():264<info> skipping device hrn0_0-xrc (want hrn0_0)
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 hrn0_0-dgram
libfabric:79126:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:79126:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:21098:sockets:core:ofi_get_list_of_addr():1407<info> Available addr: fe80::e64a:b102:617e:bee9, iface name: enp179s0, speed: 25000
libfabric:21098:sockets:core:ofi_insert_loopback_addr():1250<info> available addr: : fi_sockaddr_in://127.0.0.1:0
libfabric:79126:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:21098:sockets:core:ofi_insert_loopback_addr():1265<info> available addr: : fi_sockaddr_in6://[::1]:0
libfabric:21098:sockets:core:util_getinfo_ifs():318<info> Chosen addr for using: 150.1.68.77, speed 25000
libfabric:79126:core:core:fi_getinfo_():1033<debug> fi_getinfo: provider verbs returned success
libfabric:79126:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_rxd
libfabric:21098:core:core:fi_getinfo_():1033<debug> fi_getinfo: provider sockets returned success
libfabric:21098:ofi_mrail:fabric:mrail_get_core_info():289<info> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:79126:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_mrail
libfabric:79126:ofi_rxm:core:ofi_get_core_info():301<debug> --- End ofi_get_core_info ---
libfabric:21098:core:core:fi_getinfo_():1021<info> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:79126:core:mr:ofi_intercept_symbol():350<debug> overwriting function mmap
libfabric:21098:core:core:fi_getinfo_():987<debug> hints prov_name: verbs;^ofi_rxm
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 hrn0_0
libfabric:79126:core:mr:ofi_intercept_symbol():350<debug> overwriting function munmap
libfabric:21098:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 hrn0_0-xrc
libfabric:21098:verbs:fabric:vrb_get_matching_info():1531<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:79126:core:mr:ofi_intercept_symbol():350<debug> overwriting function mremap
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 hrn0_0-dgram
libfabric:79126:core:mr:ofi_intercept_symbol():350<debug> overwriting function madvise
libfabric:21098:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0-dgram
libfabric:79126:core:mr:ofi_intercept_symbol():350<debug> overwriting function shmat
libfabric:21098:core:core:fi_getinfo_():1033<debug> fi_getinfo: provider verbs returned success
libfabric:21098:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_rxd
libfabric:79126:core:mr:ofi_intercept_symbol():350<debug> overwriting function shmdt
libfabric:21098:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_mrail
libfabric:79126:core:mr:ofi_intercept_symbol():350<debug> overwriting function brk
libfabric:21098:core:core:fi_fabric_():1220<info> Opened fabric: IB-0xfe80000000000000
libfabric:79126:core:mr:ofi_monitors_add_cache():215<debug> MR cache disabled for FI_HMEM_ZE memory
libfabric:21098:core:core:fi_fabric_():1220<info> Opened fabric: IB-0xfe80000000000000
libfabric:21098:ofi_rxm:core:fi_param_get_():299<info> read bool var use_srx=0
libfabric:21098:ofi_rxm:core:ofi_get_core_info():296<debug> --- Begin ofi_get_core_info ---
libfabric:79126:verbs:mr:vrb_domain():348<info> MR cache enabled for FI_HMEM_SYSTEM memory
libfabric:21098:core:core:fi_getinfo_():987<debug> hints prov_name: verbs;^ofi_rxm
libfabric:79126:ofi_rxm:core:fi_param_get_():280<info> variable enable_dyn_rbuf=<not set>
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 hrn0_0
libfabric:79126:ofi_rxm:av:util_verify_av_attr():504<warn> Shared AV is unsupported
libfabric:21098:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 hrn0_0-xrc
libfabric:79126:ofi_rxm:av:util_av_init():475<info> AV size 2
libfabric:21098:verbs:core:vrb_check_hints():264<info> skipping device hrn0_0-xrc (want hrn0_0)
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 hrn0_0-dgram
libfabric:21098:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:79126:ofi_rxm:core:fi_param_get_():280<info> variable comp_per_progress=<not set>
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:79126:ofi_rxm:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:79126:ofi_rxm:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:79126:ofi_rxm:core:ofi_check_fabric_attr():404<info> Requesting provider tcp, skipping verbs;ofi_rxm
libfabric:79126:ofi_rxm:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:21098:core:core:fi_getinfo_():1033<debug> fi_getinfo: provider verbs returned success
libfabric:21098:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_rxd
libfabric:79126:ofi_rxm:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:79126:ofi_rxm:core:ofi_check_ep_attr():770<info> Tag size exceeds supported size
libfabric:21098:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_mrail
libfabric:21098:ofi_rxm:core:ofi_get_core_info():301<debug> --- End ofi_get_core_info ---
libfabric:79126:ofi_rxm:core:ofi_check_ep_attr():771<info> Supported: 6148914691236517205
libfabric:79126:ofi_rxm:core:ofi_check_ep_attr():771<info> Requested: -6148914691236517206
libfabric:21098:core:mr:ofi_intercept_symbol():350<debug> overwriting function mmap
libfabric:79126:ofi_rxm:core:fi_param_get_():299<info> read bool var use_srx=0
libfabric:21098:core:mr:ofi_intercept_symbol():350<debug> overwriting function munmap
libfabric:79126:ofi_rxm:core:ofi_get_core_info():296<debug> --- Begin ofi_get_core_info ---
libfabric:79126:core:core:fi_getinfo_():987<debug> hints prov_name: verbs;^ofi_rxm
libfabric:21098:core:mr:ofi_intercept_symbol():350<debug> overwriting function mremap
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 hrn0_0
libfabric:21098:core:mr:ofi_intercept_symbol():350<debug> overwriting function madvise
libfabric:79126:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 hrn0_0-xrc
libfabric:21098:core:mr:ofi_intercept_symbol():350<debug> overwriting function shmat
libfabric:79126:verbs:core:vrb_check_hints():264<info> skipping device hrn0_0-xrc (want hrn0_0)
libfabric:79126:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 hrn0_0-dgram
libfabric:21098:core:mr:ofi_intercept_symbol():350<debug> overwriting function shmdt
libfabric:79126:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:79126:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:21098:core:mr:ofi_intercept_symbol():350<debug> overwriting function brk
libfabric:79126:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:21098:core:mr:ofi_monitors_add_cache():215<debug> MR cache disabled for FI_HMEM_ZE memory
libfabric:79126:core:core:fi_getinfo_():1033<debug> fi_getinfo: provider verbs returned success
libfabric:79126:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_rxd
libfabric:21098:verbs:mr:vrb_domain():348<info> MR cache enabled for FI_HMEM_SYSTEM memory
libfabric:79126:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_mrail
libfabric:79126:ofi_rxm:core:ofi_get_core_info():301<debug> --- End ofi_get_core_info ---
libfabric:21098:ofi_rxm:core:fi_param_get_():280<info> variable enable_dyn_rbuf=<not set>
libfabric:79126:ofi_rxm:core:fi_param_get_():280<info> variable sar_limit=<not set>
libfabric:21098:ofi_rxm:av:util_verify_av_attr():504<warn> Shared AV is unsupported
libfabric:79126:ofi_rxm:core:rxm_ep_settings_init():2741<info> Settings:
                 MR local: MSG - 1, RxM - 0
                 Completions per progress: MSG - 1
                 Buffered min: 24
                 Min multi recv size: 16384
                 FI_EP_MSG provider inject size: 192
                 rxm inject size: 16384
                 Protocol limits: Eager: 16384, SAR: 131072
libfabric:21098:ofi_rxm:av:util_av_init():475<info> AV size 2
libfabric:79126:ofi_rxm:ep_ctrl:ofi_wait_add_fid():843<debug> Given fid (0xd54100) already added to wait list - 0xd4d7d0
libfabric:79126:ofi_rxm:ep_ctrl:ofi_wait_add_fid():843<debug> Given fid (0xd4ff20) already added to wait list - 0xd4d7d0
libfabric:21098:ofi_rxm:core:fi_param_get_():280<info> variable comp_per_progress=<not set>
libfabric:79126:verbs:ep_ctrl:vrb_pep_listen():518<info> listening on: fi_sockaddr_in://150.1.68.79:51116
libfabric:21098:ofi_rxm:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:79126:ofi_rxm:ep_ctrl:rxm_conn_cmap_alloc():1577<debug> local_name: fi_sockaddr_in://150.1.68.79:51116
libfabric:21098:ofi_rxm:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:21098:ofi_rxm:core:ofi_check_fabric_attr():404<info> Requesting provider tcp, skipping verbs;ofi_rxm
libfabric:21098:ofi_rxm:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:21098:ofi_rxm:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:21098:ofi_rxm:core:ofi_check_ep_attr():770<info> Tag size exceeds supported size
libfabric:21098:ofi_rxm:core:ofi_check_ep_attr():771<info> Supported: 6148914691236517205
libfabric:21098:ofi_rxm:core:ofi_check_ep_attr():771<info> Requested: -6148914691236517206
libfabric:21098:ofi_rxm:core:fi_param_get_():299<info> read bool var use_srx=0
libfabric:21098:ofi_rxm:core:ofi_get_core_info():296<debug> --- Begin ofi_get_core_info ---
libfabric:21098:core:core:fi_getinfo_():987<debug> hints prov_name: verbs;^ofi_rxm
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 hrn0_0
libfabric:21098:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 hrn0_0-xrc
libfabric:21098:verbs:core:vrb_check_hints():264<info> skipping device hrn0_0-xrc (want hrn0_0)
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 hrn0_0-dgram
libfabric:21098:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:21098:core:core:fi_getinfo_():1033<debug> fi_getinfo: provider verbs returned success
libfabric:21098:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_rxd
libfabric:21098:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_mrail
libfabric:21098:ofi_rxm:core:ofi_get_core_info():301<debug> --- End ofi_get_core_info ---
libfabric:21098:ofi_rxm:core:fi_param_get_():280<info> variable sar_limit=<not set>
libfabric:21098:ofi_rxm:core:rxm_ep_settings_init():2741<info> Settings:
                 MR local: MSG - 1, RxM - 0
                 Completions per progress: MSG - 1
                 Buffered min: 24
                 Min multi recv size: 16384
                 FI_EP_MSG provider inject size: 192
                 rxm inject size: 16384
                 Protocol limits: Eager: 16384, SAR: 131072
libfabric:21098:ofi_rxm:ep_ctrl:ofi_wait_add_fid():843<debug> Given fid (0x897a60) already added to wait list - 0x89a0a0
libfabric:21098:ofi_rxm:ep_ctrl:ofi_wait_add_fid():843<debug> Given fid (0x8a0d90) already added to wait list - 0x89a0a0
libfabric:21098:verbs:ep_ctrl:vrb_pep_listen():518<info> listening on: fi_sockaddr_in://150.1.68.77:48001
libfabric:21098:ofi_rxm:ep_ctrl:rxm_conn_cmap_alloc():1577<debug> local_name: fi_sockaddr_in://150.1.68.77:48001
libfabric:79126:ofi_rxm:av:ofi_ip_av_insertv():628<debug> inserting 2 addresses
libfabric:79126:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert addr: fi_sockaddr_in://150.1.68.77:48001
libfabric:21098:ofi_rxm:av:ofi_ip_av_insertv():628<debug> inserting 2 addresses
libfabric:79126:ofi_rxm:av:ip_av_insert_addr():615<debug> av_insert fi_addr: 0
libfabric:21098:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert addr: fi_sockaddr_in://150.1.68.77:48001
libfabric:21098:ofi_rxm:av:ip_av_insert_addr():615<debug> av_insert fi_addr: 0
libfabric:79126:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert addr: fi_sockaddr_in://150.1.68.79:51116
libfabric:79126:ofi_rxm:av:ip_av_insert_addr():615<debug> av_insert fi_addr: 1
libfabric:21098:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert addr: fi_sockaddr_in://150.1.68.79:51116
libfabric:21098:ofi_rxm:av:ip_av_insert_addr():615<debug> av_insert fi_addr: 1
libfabric:79126:ofi_rxm:av:ofi_ip_av_insertv():645<debug> 2 addresses successful
libfabric:21098:ofi_rxm:av:ofi_ip_av_insertv():645<debug> 2 addresses successful
libfabric:79126:ofi_rxm:ep_ctrl:rxm_cmap_alloc_handle():355<debug> Allocated handle: 0xd51fc0 for fi_addr: 0
libfabric:79126:ofi_rxm:ep_ctrl:rxm_cmap_init_handle():140<debug> [CM] handle: 0xd51fc0 RXM_CMAP_IDLE -> RXM_CMAP_IDLE
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_alloc_handle():355<debug> Allocated handle: 0x89e2f0 for fi_addr: 0
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_init_handle():140<debug> [CM] handle: 0x89e2f0 RXM_CMAP_IDLE -> RXM_CMAP_IDLE
libfabric:79126:ofi_rxm:ep_ctrl:rxm_cmap_alloc_handle():355<debug> Allocated handle: 0xd542b0 for fi_addr: 1
libfabric:79126:ofi_rxm:ep_ctrl:rxm_cmap_init_handle():140<debug> [CM] handle: 0xd542b0 RXM_CMAP_IDLE -> RXM_CMAP_IDLE
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_alloc_handle():355<debug> Allocated handle: 0x8a0110 for fi_addr: 1
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_init_handle():140<debug> [CM] handle: 0x8a0110 RXM_CMAP_IDLE -> RXM_CMAP_IDLE
libfabric:79126:ofi_rxm:core:rxm_ep_setopt():652<info> FI_OPT_MIN_MULTI_RECV set to 16384
libfabric:21098:ofi_rxm:core:rxm_ep_setopt():652<info> FI_OPT_MIN_MULTI_RECV set to 16384
#----------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2021.2, MPI-1 part
#----------------------------------------------------------------
# Date                  : Mon Jun 28 21:51:30 2021
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-957.el7.x86_64
# Version               : #1 SMP Thu Nov 8 23:39:32 UTC 2018
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /opt/x86_64/z00603408/libs/compiler/intel/21.2.0/mpi/2021.2.0/bin/IMB-MPI1

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_local
# Reduce_scatter
# Reduce_scatter_block
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_connect():686<debug> initiating MSG_EP connect for fi_addr: 1
libfabric:21098:verbs:fabric:vrb_open_ep():1063<debug> open_ep src addr: fi_sockaddr_in://150.1.68.77:0
libfabric:21098:verbs:fabric:vrb_open_ep():1066<debug> open_ep dest addr: fi_sockaddr_in://150.1.68.79:51116
libfabric:21098:verbs:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:21098:verbs:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:21098:verbs:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:21098:verbs:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:21098:verbs:core:ofi_check_ep_attr():680<info> Unsupported protocol
libfabric:21098:verbs:core:ofi_check_ep_attr():681<info> Supported: FI_PROTO_RDMA_CM_IB_XRC
libfabric:21098:verbs:core:ofi_check_ep_attr():681<info> Requested: FI_PROTO_RDMA_CM_IB_RC
libfabric:21098:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:79126:ofi_rxm:ep_ctrl:rxm_cmap_connect():686<debug> initiating MSG_EP connect for fi_addr: 0
libfabric:79126:verbs:fabric:vrb_open_ep():1063<debug> open_ep src addr: fi_sockaddr_in://150.1.68.79:0
libfabric:79126:verbs:fabric:vrb_open_ep():1066<debug> open_ep dest addr: fi_sockaddr_in://150.1.68.77:48001
libfabric:79126:verbs:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:79126:verbs:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:79126:verbs:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:79126:verbs:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:79126:verbs:core:ofi_check_ep_attr():680<info> Unsupported protocol
libfabric:79126:verbs:core:ofi_check_ep_attr():681<info> Supported: FI_PROTO_RDMA_CM_IB_XRC
libfabric:79126:verbs:core:ofi_check_ep_attr():681<info> Requested: FI_PROTO_RDMA_CM_IB_RC
libfabric:79126:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:79126:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:79126:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:21098:verbs:mr:ofi_mr_cache_reg():427<debug> reg 0x2b582f343170 (len: 18857632)
libfabric:21098:verbs:mr:ofi_mr_cache_reg():427<debug> reg 0x2b582953f0c0 (len: 180224)
libfabric:79126:verbs:mr:ofi_mr_cache_reg():427<debug> reg 0x2ad217f4c170 (len: 18857632)
libfabric:79126:verbs:mr:ofi_mr_cache_reg():427<debug> reg 0x2ad2121460c0 (len: 180224)
libfabric:79126:ofi_rxm:ep_ctrl:rxm_cmap_connect():695<debug> [CM] handle: 0xd51fc0 RXM_CMAP_IDLE -> RXM_CMAP_CONNREQ_SENT
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_connect():695<debug> [CM] handle: 0x8a0110 RXM_CMAP_IDLE -> RXM_CMAP_CONNREQ_SENT
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 hrn0_0
libfabric:21098:verbs:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:21098:verbs:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:21098:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: hrn0_0
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 hrn0_0-xrc
libfabric:21098:verbs:core:vrb_check_hints():264<info> skipping device hrn0_0-xrc (want hrn0_0)
libfabric:21098:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 hrn0_0-dgram
libfabric:21098:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:21098:verbs:eq:vrb_eq_cm_getinfo():241<debug> src: fi_sockaddr_in://150.1.68.77:48001
libfabric:21098:verbs:eq:vrb_eq_cm_getinfo():242<debug> dst: fi_sockaddr_in://150.1.68.79:37529
libfabric:21098:ofi_rxm:ep_ctrl:rxm_conn_handle_event():1257<debug> Got new connection
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_process_connreq():569<debug> Processing connreq from remote pep: fi_sockaddr_in://150.1.68.79:51116
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_process_connreq():597<debug> local_name: fi_sockaddr_in://150.1.68.77:48001
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_process_connreq():599<debug> remote_name: fi_sockaddr_in://150.1.68.79:51116
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_process_connreq():612<debug> Re-using handle: 0x8a0110 to accept remote connection
libfabric:21098:ofi_rxm:ep_ctrl:rxm_conn_close():318<debug> cancelled deferred message
libfabric:21098:ofi_rxm:ep_ctrl:rxm_conn_close():324<debug> closing msg ep
libfabric:21098:verbs:domain:vrb_ep_close():501<info> EP 0x8cccd0 is being closed
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_process_connreq():629<debug> [CM] handle: 0x8a0110 RXM_CMAP_CONNREQ_SENT -> RXM_CMAP_CONNREQ_RECV
libfabric:21098:verbs:fabric:vrb_open_ep():1063<debug> open_ep src addr: fi_sockaddr_in://150.1.68.77:48001
libfabric:21098:verbs:fabric:vrb_open_ep():1066<debug> open_ep dest addr: fi_sockaddr_in://150.1.68.79:37529
libfabric:21098:verbs:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:21098:verbs:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:21098:verbs:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:21098:verbs:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:21098:verbs:core:ofi_check_ep_attr():680<info> Unsupported protocol
libfabric:21098:verbs:core:ofi_check_ep_attr():681<info> Supported: FI_PROTO_RDMA_CM_IB_XRC
libfabric:79126:verbs:ep_ctrl:vrb_dbg_query_qp_attr():445<debug> QP attributes: min_rnr_timer: 12, timeout: 19, retry_cnt: 7, rnr_retry: 7
libfabric:21098:verbs:core:ofi_check_ep_attr():681<info> Requested: FI_PROTO_RDMA_CM_IB_RC
libfabric:21098:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:79126:ofi_rxm:ep_ctrl:rxm_conn_handle_event():1274<debug> connection successful
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:21098:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:79126:ofi_rxm:ep_ctrl:rxm_cmap_process_connect():508<debug> processing FI_CONNECTED event for handle: 0xd51fc0
libfabric:79126:ofi_rxm:ep_ctrl:rxm_cmap_process_connect():516<debug> [CM] handle: 0xd51fc0 RXM_CMAP_CONNREQ_SENT -> RXM_CMAP_CONNECTED
libfabric:21098:verbs:eq:vrb_set_rnr_timer():475<warn> Unable to modify QP attribute
libfabric:21098:ofi_rxm:ep_ctrl:rxm_conn_handle_event():1274<debug> connection successful
libfabric:21098:ofi_rxm:ep_ctrl:rxm_cmap_process_connect():508<debug> processing FI_CONNECTED event for handle: 0x89b870
IMB-MPI1: prov/rxm/src/rxm_conn.c:510: rxm_cmap_process_connect: Assertion `handle->state == RXM_CMAP_CONNREQ_SENT' failed.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 21098 RUNNING AT 6248-node77
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================
oleotiger commented 3 years ago

network card information:

# ibv_devinfo -v
hca_id: hrn0_0
        transport:                      InfiniBand (0)
        fw_ver:                         3.5.0.3
        node_guid:                      de21:e2ff:fe30:cfbb
        sys_image_guid:                 de21:e2ff:fe30:cfbb
        vendor_id:                      0x19e5
        vendor_part_id:                 6178
        hw_ver:                         0x0
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0x557000
        max_qp:                         860158
        max_qp_wr:                      8191
        device_cap_flags:               0x01321400
                                        PORT_ACTIVE_EVENT
                                        RC_RNR_NAK_GEN
                                        MEM_WINDOW
                                        XRC
                                        MEM_MGT_EXTENSIONS
                                        MEM_WINDOW_TYPE_2B
        max_sge:                        8
        max_sge_rd:                     0
        max_cq:                         860160
        max_cqe:                        65535
        max_mr:                         130944
        max_pd:                         131072
        max_qp_rd_atom:                 128
        max_ee_rd_atom:                 0
        max_res_rd_atom:                110100224
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         130944
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  0
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0
        max_ah:                         2147483647
        max_fmr:                        255
        max_map_per_fmr:                255
        max_srq:                        65536
        max_srq_wr:                     16383
        max_srq_sge:                    7
        max_pkeys:                      1
        local_ca_ack_delay:             15
        general_odp_caps:
        rc_odp_caps:
                                        NO SUPPORT
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        NO SUPPORT
        completion_timestamp_mask not supported
        core clock not supported
        device_cap_flags_ex:            0x0
        tso_caps:
        max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        tag matching not supported
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        max_msg_sz:             0x80000000
                        port_cap_flags:         0x00010000
                        max_vl_num:             invalid value (0)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            16
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           1X (1)
                        active_speed:           25.0 Gbps (32)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               fe80:0000:0000:0000:de21:e2ff:fe30:cfbb
                        GID[  1]:               fe80:0000:0000:0000:de21:e2ff:fe30:cfbb
                        GID[  2]:               0000:0000:0000:0000:0000:ffff:9601:444f
                        GID[  3]:               0000:0000:0000:0000:0000:ffff:9601:444f

hca_id: hrn0_1
        transport:                      InfiniBand (0)
        fw_ver:                         3.5.0.3
        node_guid:                      de21:e2ff:fe30:cfbc
        sys_image_guid:                 de21:e2ff:fe30:cfbc
        vendor_id:                      0x19e5
        vendor_part_id:                 6178
        hw_ver:                         0x0
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0x557000
        max_qp:                         20478
        max_qp_wr:                      8191
        device_cap_flags:               0x01321400
                                        PORT_ACTIVE_EVENT
                                        RC_RNR_NAK_GEN
                                        MEM_WINDOW
                                        XRC
                                        MEM_MGT_EXTENSIONS
                                        MEM_WINDOW_TYPE_2B
        max_sge:                        8
        max_sge_rd:                     0
        max_cq:                         20480
        max_cqe:                        65535
        max_mr:                         32640
        max_pd:                         131072
        max_qp_rd_atom:                 128
        max_ee_rd_atom:                 0
        max_res_rd_atom:                2621184
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         32640
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  0
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0
        max_ah:                         2147483647
        max_fmr:                        255
        max_map_per_fmr:                255
        max_srq:                        16384
        max_srq_wr:                     16383
        max_srq_sge:                    7
        max_pkeys:                      1
        local_ca_ack_delay:             15
        general_odp_caps:
        rc_odp_caps:
                                        NO SUPPORT
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        NO SUPPORT
        completion_timestamp_mask not supported
        core clock not supported
        device_cap_flags_ex:            0x0
        tso_caps:
        max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        tag matching not supported
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        max_msg_sz:             0x80000000
                        port_cap_flags:         0x00010000
                        max_vl_num:             invalid value (0)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            16
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           invalid widthX (0)
                        active_speed:           invalid speed (0)
                        phys_state:             DISABLED (3)
                        GID[  0]:               fe80:0000:0000:0000:de21:e2ff:fe30:cfbc
                        GID[  1]:               fe80:0000:0000:0000:de21:e2ff:fe30:cfbc

hca_id: hrn0_2
        transport:                      InfiniBand (0)
        fw_ver:                         3.5.0.3
        node_guid:                      de21:e2ff:fe30:cfbd
        sys_image_guid:                 de21:e2ff:fe30:cfbd
        vendor_id:                      0x19e5
        vendor_part_id:                 6178
        hw_ver:                         0x0
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0x557000
        max_qp:                         20478
        max_qp_wr:                      8191
        device_cap_flags:               0x01321400
                                        PORT_ACTIVE_EVENT
                                        RC_RNR_NAK_GEN
                                        MEM_WINDOW
                                        XRC
                                        MEM_MGT_EXTENSIONS
                                        MEM_WINDOW_TYPE_2B
        max_sge:                        8
        max_sge_rd:                     0
        max_cq:                         20480
        max_cqe:                        65535
        max_mr:                         32640
        max_pd:                         131072
        max_qp_rd_atom:                 128
        max_ee_rd_atom:                 0
        max_res_rd_atom:                2621184
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         32640
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  0
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0
        max_ah:                         2147483647
        max_fmr:                        255
        max_map_per_fmr:                255
        max_srq:                        16384
        max_srq_wr:                     16383
        max_srq_sge:                    7
        max_pkeys:                      1
        local_ca_ack_delay:             15
        general_odp_caps:
        rc_odp_caps:
                                        NO SUPPORT
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        NO SUPPORT
        completion_timestamp_mask not supported
        core clock not supported
        device_cap_flags_ex:            0x0
        tso_caps:
        max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        tag matching not supported
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        max_msg_sz:             0x80000000
                        port_cap_flags:         0x00010000
                        max_vl_num:             invalid value (0)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            16
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           invalid widthX (0)
                        active_speed:           invalid speed (0)
                        phys_state:             DISABLED (3)
                        GID[  0]:               fe80:0000:0000:0000:de21:e2ff:fe30:cfbd
                        GID[  1]:               fe80:0000:0000:0000:de21:e2ff:fe30:cfbd

hca_id: hrn0_3
        transport:                      InfiniBand (0)
        fw_ver:                         3.5.0.3
        node_guid:                      de21:e2ff:fe30:cfbe
        sys_image_guid:                 de21:e2ff:fe30:cfbe
        vendor_id:                      0x19e5
        vendor_part_id:                 6178
        hw_ver:                         0x0
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0x557000
        max_qp:                         20478
        max_qp_wr:                      8191
        device_cap_flags:               0x01321400
                                        PORT_ACTIVE_EVENT
                                        RC_RNR_NAK_GEN
                                        MEM_WINDOW
                                        XRC
                                        MEM_MGT_EXTENSIONS
                                        MEM_WINDOW_TYPE_2B
        max_sge:                        8
        max_sge_rd:                     0
        max_cq:                         20480
        max_cqe:                        65535
        max_mr:                         32640
        max_pd:                         131072
        max_qp_rd_atom:                 128
        max_ee_rd_atom:                 0
        max_res_rd_atom:                2621184
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         32640
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  0
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0
        max_ah:                         2147483647
        max_fmr:                        255
        max_map_per_fmr:                255
        max_srq:                        16384
        max_srq_wr:                     16383
        max_srq_sge:                    7
        max_pkeys:                      1
        local_ca_ack_delay:             15
        general_odp_caps:
        rc_odp_caps:
                                        NO SUPPORT
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        NO SUPPORT
        completion_timestamp_mask not supported
        core clock not supported
        device_cap_flags_ex:            0x0
        tso_caps:
        max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        tag matching not supported
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        max_msg_sz:             0x80000000
                        port_cap_flags:         0x00010000
                        max_vl_num:             invalid value (0)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            16
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           invalid widthX (0)
                        active_speed:           invalid speed (0)
                        phys_state:             DISABLED (3)
                        GID[  0]:               fe80:0000:0000:0000:de21:e2ff:fe30:cfbe
                        GID[  1]:               fe80:0000:0000:0000:de21:e2ff:fe30:cfbe
shefty commented 3 years ago

It looks like the problem starts here:

libfabric:23352:verbs:eq:vrb_set_rnr_timer():475<warn> Unable to modify QP attribute

That error is coming from the libibverbs layer - hrn0 driver. I don't believe this error needs to be fatal. You could try modifying vrb_set_rnr_timer() to discard any failures.

Untested changes for this are here: #6893. (Note, not even compile tested.)

shefty commented 3 years ago

Please retest with the tip of master.

shefty commented 3 years ago

Have you had a chance to retest with the latest main?

shefty commented 2 years ago

Assert is in code that was replaced with v1.14 and main, with no activity in over 6 months. Closing. Please open a new issue if there are problems in v1.14 or main (preferably main).