open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Getting Segmentation fault when run OpenFoam with openib on aarch64 #3145

Open rishards opened 7 years ago

rishards commented 7 years ago

Recently I have been testing OpenFoam with openib on aarch64, and get a segmentation fault:

__kernel_rt_sigreturn
[node5:45736] *** Process received signal ***
[node5:45736] Signal: Segmentation fault (11)
[node5:45736] Signal code:  (-6)
[node5:45736] Failing at address: 0xb2a8
[node5:45736] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000048f1500]
[node5:45736] [ 1] /usr/lib64/libc.so.6(gsignal+0x38)[0x400007ed08f0]
[node5:45736] [ 2] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000048f1500]
[node5:45736] *** End of error message ***

the core dump backtrace:

#2  0x0000000000000000 in ?? ()                
#3  0x000040003af1c5f8 in btl_openib_handle_incoming (openib_btl=0x799b9d0, ep=0x7e116b0,          frag=0x825e030, byte_len=218) at btl_openib_component.c:3103 
#4  0x000040003af1ddac in progress_one_device (device=0x7990a90) at btl_openib_component.c:3736
#5  0x000040003af1df38 in btl_openib_component_progress () at btl_openib_component.c:3768
#6  0x0000400038c57d48 in opal_progress () at runtime/opal_progress.c:222
#7  0x00004000384a701c in sync_wait_st (sync=0xffffe52a3048) at ../opal/threads/wait_sync.h:82
#8  0x00004000384a79c0 in ompi_request_default_wait_all (count=12, requests=0x7e2af20, statuses=0x0) at request/req_wait.c:237
#9  0x0000400038527d10 in PMPI_Waitall (count=12, requests=0x7e2af20, statuses=0x0) at pwaitall.c:77
#10 0x00004000376df128 in Foam::UPstream::waitRequests(int) () from /home/OpenFOAM-4.1/platforms/linuxAArch64GccDPInt32Opt/lib/openmpi-system/libPstream.so
#11 0x0000400036e40bd4 in Foam::lduMatrix::updateMatrixInterfaces(Foam::FieldField<Foam::Field, double> const&, Foam::UPtrList<Foam::lduInterfaceField const> const&, Foam::Field<double> const&, Foam::Field<double>&, unsigned char) const () from /home/OpenFOAM-4.1/platforms/linuxAArch64GccDPInt32Opt/lib/libOpenFOAM.so
#12 0x0000400036e4ee28 in Foam::GaussSeidelSmoother::smooth(Foam::word const&, Foam::Field<double>&, Foam::lduMatrix const&, Foam::Field<double> const&, Foam::FieldField<Foam::Field, double> const&, Foam::UPtrList<Foam::lduInterfaceField const> const&, unsigned char, int) () from /home/OpenFOAM-4.1/platforms/linuxAArch64GccDPInt32Opt/lib/libOpenFOAM.so
#13 0x0000400036e6b1d4 in Foam::GAMGSolver::Vcycle(Foam::PtrList<Foam::lduMatrix::smoother> const&, Foam::Field<double>&, Foam::Field<double> const&, Foam::Field<double>&, Foam::Field<double>&, Foam::Field<double>&, Foam::Field<double>&, Foam::Field<double>&, Foam::PtrList<Foam::Field<double> >&, Foam::PtrList<Foam::Field<double> >&, unsigned char) const ()
   from /home/OpenFOAM-4.1/platforms/linuxAArch64GccDPInt32Opt/lib/libOpenFOAM.so
#14 0x0000400036e6caec in Foam::GAMGSolver::solve(Foam::Field<double>&, Foam::Field<double> const&, unsigned char) const ()
   from /home/OpenFOAM-4.1/platforms/linuxAArch64GccDPInt32Opt/lib/libOpenFOAM.so
#15 0x0000400035010180 in Foam::fvMatrix<double>::solveSegregated(Foam::dictionary const&) () from /home/OpenFOAM-4.1/platforms/linuxAArch64GccDPInt32Opt/lib/libfiniteVolume.so
#16 0x000000000046728c in Foam::fvMatrix<double>::solve (this=this@entry=0xffffe52a4b10, solverControls=...) at /home/OpenFOAM-4.1/src/finiteVolume/lnInclude/fvMatrixSolve.C:82
#17 0x0000000000467520 in Foam::fvMatrix<double>::solve (this=this@entry=0xffffe52a4b10) at /home/OpenFOAM-4.1/src/finiteVolume/lnInclude/fvMatrixSolve.C:325
#18 0x0000000000423100 in main (argc=<optimized out>, argv=<optimized out>) at pEqn.H:33

I found that the reg->cbfunc=0x00,so i print the value of hdr->tag,and found the value of hdr->tag just changed around the line.

tag before:0x2 tag after:0x41

int tag_before = hdr->tag;
#if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
        /* The COPY_ASYNC flag should not be set */
        assert(0 == (des->des_flags & MCA_BTL_DES_FLAGS_CUDA_COPY_ASYNC));
#endif /* OPAL_CUDA_SUPPORT */
        reg = mca_btl_base_active_message_trigger + hdr->tag;

        if(reg->cbfunc==0){
                printf("tag before:%x\n",tag_before);
                printf("tag after:%x\n",hdr->tag);
        }
        reg->cbfunc( &openib_btl->super, hdr->tag, des, reg->cbdata ); 

Version of Openmpi:Master Version of Openfoam:4.1 The example : /tutorials/incompressible/simpleFoam/motorBike/ The command: mpirun -mca btl openib,sm,self -np 6 --hostfile mfile snappyHexMesh/simpleFoam -overwrite –parallel

Have someone met this before,i hope to get some help!

hppritcha commented 7 years ago

Do you know how your OpenMPI was configured? Could you post the output of ompi_info?

rishards commented 7 years ago

The configuration of openib is below,It seemed the problem is related to the two issues.

https://github.com/open-mpi/ompi/issues/2067

https://github.com/open-mpi/ompi/issues/2161

I add some codes at btl_openib.c line 1766 mca_btl_openib_sendi line 1823 mca_btl_openib_endpoint_credit_acquire the rc of mca_btl_openib_endpoint_credit_acquire sometimes may be OPAL_ERR_OUT_OF_RESOURCE(-2) then program crashed because of the Segmentation fault

I also add some codes at pml_ob1_recvfrag.c line 745 sometimes it enter this code branch,wrong_seq,then program hang

I found with more processes in parallel,the higher the probability of problems.

I try to add some configuration,but it does not work. mpirun --allow-run-as-root -np ${1} -mca btl_openib_use_message_coalescing 0 -mca btl_openib_max_inline_send 0 -mca btl_openib_max_inline_data 0 -mca btl_openib_use_eager_rdma 0 -mca btl openib,sm,self --hostfile /home/openfoam/nfs_openfoam/motorBike4/hostfile${2} patchSummary -parallel

I also have the same test on X86 Platform with same configuration,the program run normally, Does it have some configuration could make it adapted to aarch64 platform?

Thanks

MCA btl openib: ---------------------------------------------------
          MCA btl openib: parameter "btl_openib_verbose" (current value: "false", data source: default, level: 9 dev/all, type: bool)
                          Output some verbose OpenIB BTL information (0 = no output, nonzero = output)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_warn_no_device_params_found" (current value: "true", data source: default, level: 9 dev/all, type: bool, synonyms: btl_openib_warn_no_hca_params_found)
                          Warn when no device-specific parameters are found in the INI file specified by the btl_openib_device_param_files MCA parameter (0 = do not warn; any other value = warn)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_warn_default_gid_prefix" (current value: "true", data source: default, level: 9 dev/all, type: bool)
                          Warn when there is more than one active ports and at least one of them connected to the network with only default GID prefix configured (0 = do not warn; any other value = warn)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_warn_nonexistent_if" (current value: "true", data source: default, level: 9 dev/all, type: bool)
                          Warn if non-existent devices and/or ports are specified in the btl_openib_if_[in|ex]clude MCA parameters (0 = do not warn; any other value = warn)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_abort_not_enough_reg_mem" (current value: "false", data source: default, level: 9 dev/all, type: bool)
                          If there is not enough registered memory available on the system for Open MPI to function properly, Open MPI will issue a warning.  If this MCA parameter is set to true, then Open MPI will also abort all MPI jobs (0 = warn, but do not abort; any other value = warn and abort)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_poll_cq_batch" (current value: "256", data source: default, level: 9 dev/all, type: unsigned_int)
                          Retrieve up to poll_cq_batch completions from CQ
          MCA btl openib: parameter "btl_openib_device_param_files" (current value: "/usr/local/share/openmpi/mca-btl-openib-device-params.ini", data source: default, level: 9 dev/all, type: string, synonyms: btl_openib_hca_param_files)
                          Colon-delimited list of INI-style files that contain device vendor/part-specific parameters (use semicolon for Windows)
          MCA btl openib: parameter "btl_openib_device_type" (current value: "all", data source: default, level: 9 dev/all, type: int)
                          Specify to only use IB or iWARP network adapters (infiniband = only use InfiniBand HCAs; iwarp = only use iWARP NICs; all = use any available adapters)
                          Valid values: 0:"infiniband", 0:"ib", 1:"iwarp", 1:"iw", 2:"all"
          MCA btl openib: parameter "btl_openib_max_btls" (current value: "-1", data source: default, level: 9 dev/all, type: int)
                          Maximum number of device ports to use (-1 = use all available, otherwise must be >= 1)
          MCA btl openib: parameter "btl_openib_free_list_num" (current value: "8", data source: default, level: 9 dev/all, type: int)
                          Initial size of free lists (must be >= 1)
          MCA btl openib: parameter "btl_openib_free_list_max" (current value: "-1", data source: default, level: 9 dev/all, type: int)
                          Maximum size of free lists (-1 = infinite, otherwise must be >= 0)
          MCA btl openib: parameter "btl_openib_free_list_inc" (current value: "32", data source: default, level: 9 dev/all, type: int)
                          Increment size of free lists (must be >= 1)
          MCA btl openib: parameter "btl_openib_mpool_hints" (current value: "", data source: default, level: 9 dev/all, type: string)
                          hints for selecting a memory pool (default: none)
          MCA btl openib: parameter "btl_openib_rcache" (current value: "grdma", data source: default, level: 9 dev/all, type: string)
                          Name of the registration cache to be used (it is unlikely that you will ever want to change this)
          MCA btl openib: parameter "btl_openib_reg_mru_len" (current value: "16", data source: default, level: 9 dev/all, type: int)
                          Length of the registration cache most recently used list (must be >= 1)
          MCA btl openib: parameter "btl_openib_cq_size" (current value: "8192", data source: default, level: 9 dev/all, type: int, synonyms: btl_openib_ib_cq_size)
                          Minimum size of the OpenFabrics completion queue (CQs are automatically sized based on the number of peer MPI processes; this value determines the *minimum* size of all CQs)
          MCA btl openib: parameter "btl_openib_max_inline_data" (current value: "-1", data source: default, level: 9 dev/all, type: int, synonyms: btl_openib_ib_max_inline_data)
                          Maximum size of inline data segment (-1 = run-time probe to discover max value, otherwise must be >= 0). If not explicitly set, use max_inline_data from the INI file containing device-specific parameters
          MCA btl openib: parameter "btl_openib_pkey" (current value: "0", data source: default, level: 9 dev/all, type: unsigned_int, synonyms: btl_openib_ib_pkey_val)
                          OpenFabrics partition key (pkey) value. Unsigned integer decimal or hex values are allowed (e.g., "3" or "0x3f") and will be masked against the maximum allowable IB partition key value (0x7fff)
          MCA btl openib: parameter "btl_openib_psn" (current value: "0", data source: default, level: 9 dev/all, type: unsigned_int, synonyms: btl_openib_ib_psn)
                          OpenFabrics packet sequence starting number (must be >= 0)
          MCA btl openib: parameter "btl_openib_ib_qp_ous_rd_atom" (current value: "4", data source: default, level: 9 dev/all, type: unsigned_int)
                          InfiniBand outstanding atomic reads (must be >= 0)
          MCA btl openib: parameter "btl_openib_mtu" (current value: "1k", data source: default, level: 9 dev/all, type: int, synonyms: btl_openib_ib_mtu)
                          OpenFabrics MTU, in bytes (if not specified in INI files).  Valid values are: 1=256 bytes, 2=512 bytes, 3=1024 bytes, 4=2048 bytes, 5=4096 bytes
                          Valid values: 1:"256B", 2:"512B", 3:"1k", 4:"2k", 5:"4k"
          MCA btl openib: parameter "btl_openib_ib_min_rnr_timer" (current value: "25", data source: default, level: 9 dev/all, type: unsigned_int)
                          InfiniBand minimum "receiver not ready" timer, in seconds (must be >= 0 and <= 31)
          MCA btl openib: parameter "btl_openib_ib_timeout" (current value: "20", data source: default, level: 9 dev/all, type: unsigned_int)
                          InfiniBand transmit timeout, plugged into formula: 4.096 microseconds * (2^btl_openib_ib_timeout) (must be >= 0 and <= 31)
          MCA btl openib: parameter "btl_openib_ib_retry_count" (current value: "7", data source: default, level: 9 dev/all, type: unsigned_int)
                          InfiniBand transmit retry count (must be >= 0 and <= 7)
          MCA btl openib: parameter "btl_openib_ib_rnr_retry" (current value: "7", data source: default, level: 9 dev/all, type: unsigned_int)
                          InfiniBand "receiver not ready" retry count; applies *only* to SRQ/XRC queues.  PP queues use RNR retry values of 0 because Open MPI performs software flow control to guarantee that RNRs never occur (must be >= 0 and <= 7; 7 = "infinite")
          MCA btl openib: parameter "btl_openib_ib_max_rdma_dst_ops" (current value: "4", data source: default, level: 9 dev/all, type: unsigned_int)
                          InfiniBand maximum pending RDMA destination operations (must be >= 0)
          MCA btl openib: parameter "btl_openib_ib_service_level" (current value: "0", data source: default, level: 9 dev/all, type: unsigned_int)
                          InfiniBand service level (must be >= 0 and <= 15)
          MCA btl openib: parameter "btl_openib_ib_path_record_service_level" (current value: "0", data source: default, level: 9 dev/all, type: unsigned_int)
                          Enable getting InfiniBand service level from PathRecord (must be >= 0, 0 = disabled, positive = try to get the service level from PathRecord)
          MCA btl openib: parameter "btl_openib_use_eager_rdma" (current value: "-1", data source: default, level: 9 dev/all, type: int)
                          Use RDMA for eager messages (-1 = use device default, 0 = do not use eager RDMA, 1 = use eager RDMA)
          MCA btl openib: parameter "btl_openib_eager_rdma_threshold" (current value: "16", data source: default, level: 9 dev/all, type: int)
                          Use RDMA for short messages after this number of messages are received from a given peer (must be >= 1)
          MCA btl openib: parameter "btl_openib_max_eager_rdma" (current value: "16", data source: default, level: 9 dev/all, type: int)
                          Maximum number of peers allowed to use RDMA for short messages (RDMA is used for all long messages, except if explicitly disabled, such as with the "dr" pml) (must be >= 0)
          MCA btl openib: parameter "btl_openib_eager_rdma_num" (current value: "17", data source: default, level: 9 dev/all, type: int)
                          Number of RDMA buffers to allocate for small messages (must be >= 1)
          MCA btl openib: parameter "btl_openib_btls_per_lid" (current value: "1", data source: default, level: 9 dev/all, type: unsigned_int)
                          Number of BTLs to create for each InfiniBand LID (must be >= 1)
          MCA btl openib: parameter "btl_openib_max_lmc" (current value: "1", data source: default, level: 9 dev/all, type: unsigned_int)
                          Maximum number of LIDs to use for each device port (must be >= 0, where 0 = use all available)
          MCA btl openib: parameter "btl_openib_enable_apm_over_lmc" (current value: "0", data source: default, level: 9 dev/all, type: int)
                          Maximum number of alternative paths for each device port (must be >= -1, where 0 = disable apm, -1 = all available alternative paths )
          MCA btl openib: parameter "btl_openib_enable_apm_over_ports" (current value: "0", data source: default, level: 9 dev/all, type: int)
                          Enable alternative path migration (APM) over different ports of the same device (must be >= 0, where 0 = disable APM over ports, 1 = enable APM over ports of the same device)
          MCA btl openib: parameter "btl_openib_use_async_event_thread" (current value: "true", data source: default, level: 9 dev/all, type: bool)
                          If nonzero, use the thread that will handle InfiniBand asynchronous events
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_enable_srq_resize" (current value: "true", data source: default, level: 9 dev/all, type: bool)
                          Enable/Disable on demand SRQ resize. (0 = without resizing, nonzero = with resizing)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_rroce_enable" (current value: "false", data source: default, level: 9 dev/all, type: bool)
                          Enable/Disable routing between different subnets(0 = disable, nonzero = enable)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_buffer_alignment" (current value: "64", data source: default, level: 9 dev/all, type: unsigned_int)
                          Preferred communication buffer alignment, in bytes (must be > 0 and power of two)
          MCA btl openib: parameter "btl_openib_use_message_coalescing" (current value: "false", data source: default, level: 9 dev/all, type: bool)
                          If nonzero, use message coalescing
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_cq_poll_ratio" (current value: "100", data source: default, level: 9 dev/all, type: unsigned_int)
                          How often to poll high priority CQ versus low priority CQ
          MCA btl openib: parameter "btl_openib_eager_rdma_poll_ratio" (current value: "100", data source: default, level: 9 dev/all, type: unsigned_int)
                          How often to poll eager RDMA channel versus CQ
          MCA btl openib: parameter "btl_openib_hp_cq_poll_per_progress" (current value: "10", data source: default, level: 9 dev/all, type: unsigned_int)
                          Max number of completion events to process for each call of BTL progress engine
          MCA btl openib: parameter "btl_openib_max_hw_msg_size" (current value: "0", data source: default, level: 9 dev/all, type: unsigned_int)
                          Maximum size (in bytes) of a single fragment of a long message when using the RDMA protocols (must be > 0 and <= hw capabilities).
          MCA btl openib: parameter "btl_openib_allow_max_memory_registration" (current value: "true", data source: default, level: 9 dev/all, type: bool)
                          Allow maximum possible memory to register with HCA
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_memory_registration_verbose" (current value: "0", data source: default, level: 9 dev/all, type: int)
                          Output some verbose memory registration information (0 = no output, nonzero = output)
          MCA btl openib: parameter "btl_openib_ignore_locality" (current value: "0", data source: default, level: 9 dev/all, type: int)
                          Ignore any locality information and use all devices (0 = use locality informaiton and use only close devices, nonzero = ignore locality information)
          MCA btl openib: informational "btl_openib_have_fork_support" (current value: "true", data source: default, level: 9 dev/all, type: bool)
                          Whether the OpenFabrics stack supports applications that invoke the "fork()" system call or not (0 = no, 1 = yes). Note that this value does NOT indicate whether the system being run on supports "fork()" with OpenFabrics applications or not.
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_exclusivity" (current value: "1024", data source: default, level: 7 dev/basic, type: unsigned_int)
                          BTL exclusivity (must be >= 0)
          MCA btl openib: parameter "btl_openib_flags" (current value: "send,put,get,fetching-atomics,need-ack,need-csum,hetero-rdma", data source: default, level: 5 tuner/detail, type: unsigned_int)
                          BTL bit flags (general flags: send, put, get, in-place, hetero-rdma, atomics, fetching-atomics)
                          Valid values: Comma-delimited list of:  0x1:"send", 0x2:"put", 0x4:"get", 0x8:"inplace", 0x4000:"signaled", 0x8000:"atomics", 0x10000:"fetching-atomics", 0x20000:"static", 0x400:"cuda-put", 0x800:"cuda-get", 0x1000:"cuda-async-send", 0x2000:"cuda-async-recv", 0x200:"failover", 0x10:"need-ack", 0x20:"need-csum", 0x100:"hetero-rdma"
          MCA btl openib: informational "btl_openib_atomic_flags" (current value: "add,compare-and-swap", data source: default, level: 5 tuner/detail, type: unsigned_int)
                          BTL atomic support flags
                          Valid values: Comma-delimited list of:  0x1:"add", 0x200:"and", 0x400:"or", 0x800:"xor", 0x1000:"land", 0x2000:"lor", 0x4000:"lxor", 0x10000:"swap", 0x100000:"min", 0x200000:"max", 0x10000000:"compare-and-swap", 0x20000000:"global"
          MCA btl openib: parameter "btl_openib_rndv_eager_limit" (current value: "12288", data source: default, level: 4 tuner/basic, type: size_t)
                          Size (in bytes, including header) of "phase 1" fragment sent for all large messages (must be >= 0 and <= eager_limit)
          MCA btl openib: parameter "btl_openib_eager_limit" (current value: "12288", data source: default, level: 4 tuner/basic, type: size_t)
                          Maximum size (in bytes, including header) of "short" messages (must be >= 1).
          MCA btl openib: parameter "btl_openib_get_limit" (current value: "18446744073709551615", data source: default, level: 4 tuner/basic, type: size_t)
                          Maximum size (in bytes) for btl get
          MCA btl openib: parameter "btl_openib_get_alignment" (current value: "0", data source: default, level: 6 tuner/all, type: size_t)
                          Alignment required for btl get
          MCA btl openib: parameter "btl_openib_put_limit" (current value: "18446744073709551615", data source: default, level: 4 tuner/basic, type: size_t)
                          Maximum size (in bytes) for btl put
          MCA btl openib: parameter "btl_openib_put_alignment" (current value: "0", data source: default, level: 6 tuner/all, type: size_t)
                          Alignment required for btl put
          MCA btl openib: parameter "btl_openib_max_send_size" (current value: "65536", data source: default, level: 4 tuner/basic, type: size_t)
                          Maximum size (in bytes) of a single "phase 2" fragment of a long message when using the pipeline protocol (must be >= 1)
          MCA btl openib: parameter "btl_openib_rdma_pipeline_send_length" (current value: "1048576", data source: default, level: 4 tuner/basic, type: size_t)
                          Length of the "phase 2" portion of a large message (in bytes) when using the pipeline protocol.  This part of the message will be split into fragments of size max_send_size and sent using send/receive semantics (must be >= 0; only relevant when the PUT flag is set)
          MCA btl openib: parameter "btl_openib_rdma_pipeline_frag_size" (current value: "1048576", data source: default, level: 4 tuner/basic, type: size_t)
                          Maximum size (in bytes) of a single "phase 3" fragment from a long message when using the pipeline protocol.  These fragments will be sent using RDMA semantics (must be >= 1; only relevant when the PUT flag is set)
          MCA btl openib: parameter "btl_openib_min_rdma_pipeline_size" (current value: "1060864", data source: default, level: 4 tuner/basic, type: size_t)
                          Messages smaller than this size (in bytes) will not use the RDMA pipeline protocol.  Instead, they will be split into fragments of max_send_size and sent using send/receive semantics (must be >=0, and is automatically adjusted up to at least (eager_limit+btl_rdma_pipeline_send_length); only relevant when the PUT flag is set)
          MCA btl openib: parameter "btl_openib_latency" (current value: "4", data source: default, level: 5 tuner/detail, type: unsigned_int)
                          Approximate latency of interconnect (0 = auto-detect value at run-time [not supported in all BTL modules], >= 1 = latency in microseconds)
          MCA btl openib: parameter "btl_openib_bandwidth" (current value: "0", data source: default, level: 5 tuner/detail, type: unsigned_int)
                          Approximate maximum bandwidth of interconnect (0 = auto-detect value at run-time [not supported in all BTL modules], >= 1 = bandwidth in Mbps)
          MCA btl openib: parameter "btl_openib_receive_queues" (current value: "S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64", data source: default, level: 9 dev/all, type: string)
                          Colon-delimited, comma-delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
          MCA btl openib: parameter "btl_openib_if_include" (current value: "", data source: default, level: 9 dev/all, type: string)
                          Comma-delimited list of devices/ports to be used (e.g. "mthca0,mthca1:2"; empty value means to use all ports found).  Mutually exclusive with btl_openib_if_exclude.
          MCA btl openib: parameter "btl_openib_if_exclude" (current value: "", data source: default, level: 9 dev/all, type: string)
                          Comma-delimited list of device/ports to be excluded (empty value means to not exclude any ports).  Mutually exclusive with btl_openib_if_include.
          MCA btl openib: parameter "btl_openib_ipaddr_include" (current value: "", data source: default, level: 9 dev/all, type: string)
                          Comma-delimited list of IP Addresses to be used (e.g. "192.168.1.0/24").  Mutually exclusive with btl_openib_ipaddr_exclude.
          MCA btl openib: parameter "btl_openib_ipaddr_exclude" (current value: "", data source: default, level: 9 dev/all, type: string)
                          Comma-delimited list of IP Addresses to be excluded (e.g. "192.168.1.0/24").  Mutually exclusive with btl_openib_ipaddr_include.
          MCA btl openib: parameter "btl_openib_gid_index" (current value: "0", data source: default, level: 9 dev/all, type: int)
                          GID index to use on verbs device ports
          MCA btl openib: parameter "btl_openib_allow_different_subnets" (current value: "false", data source: default, level: 9 dev/all, type: bool)
                          Allow connecting processes from different IB subnets.(0 = do not allow; 1 = allow)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_cpc_include" (current value: "", data source: default, level: 9 dev/all, type: string)
                          Method used to select OpenFabrics connections (valid values: rdmacm,udcm)
          MCA btl openib: parameter "btl_openib_cpc_exclude" (current value: "", data source: default, level: 9 dev/all, type: string)
                          Method used to exclude OpenFabrics connections (valid values: rdmacm,udcm)
          MCA btl openib: parameter "btl_openib_connect_rdmacm_priority" (current value: "30", data source: default, level: 9 dev/all, type: int)
                          The selection method priority for rdma_cm
          MCA btl openib: parameter "btl_openib_connect_rdmacm_port" (current value: "0", data source: default, level: 9 dev/all, type: unsigned_int)
                          The selection method port for rdma_cm
          MCA btl openib: parameter "btl_openib_connect_rdmacm_resolve_timeout" (current value: "30000", data source: default, level: 9 dev/all, type: int)
                          The timeout (in miliseconds) for address and route resolution
          MCA btl openib: parameter "btl_openib_connect_rdmacm_retry_count" (current value: "20", data source: default, level: 9 dev/all, type: int)
                          Maximum number of times rdmacm will retry route resolution
          MCA btl openib: parameter "btl_openib_connect_rdmacm_reject_causes_connect_error" (current value: "false", data source: default, level: 9 dev/all, type: bool)
                          The drivers for some devices are buggy such that an RDMA REJECT action may result in a CONNECT_ERROR event instead of a REJECTED event.  Setting this MCA parameter to true tells Open MPI to treat CONNECT_ERROR events on connections where a REJECT is expected as a REJECT (default: false)
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes
          MCA btl openib: parameter "btl_openib_connect_udcm_priority" (current value: "63", data source: default, level: 9 dev/all, type: int)
                          Priority of the udcm connection method
          MCA btl openib: parameter "btl_openib_connect_udcm_recv_count" (current value: "512", data source: default, level: 9 dev/all, type: int)
                          Number of registered buffers to post
          MCA btl openib: parameter "btl_openib_connect_udcm_timeout" (current value: "500000", data source: default, level: 9 dev/all, type: int)
                          Ack timeout for udcm connection messages
          MCA btl openib: parameter "btl_openib_connect_udcm_max_retry" (current value: "25", data source: default, level: 9 dev/all, type: int)
                          Maximum number of times to retry sending a udcm connection message
rishards commented 7 years ago

@rhc54 @hppritcha @jsquyres Can someone please take a look ?

jsquyres commented 7 years ago

@shamisp ARM: Can you have a look at this?