ornladios / ADIOS2

Next generation of ADIOS developed in the Exascale Computing Program
https://adios2.readthedocs.io/en/latest/index.html
Apache License 2.0
268 stars 125 forks source link

Installation of libfabric for RDMA support in ADIOS2: Probing required features with fi_pingpong #2674

Open franzpoeschel opened 3 years ago

franzpoeschel commented 3 years ago

This is a question on the configuration of libfabric for RDMA-based staging in SST.

I am currently experiencing issues with the libfabric-based RDMA backend of SST on our internal cluster Hemera at HZDR. For reporting to our admins, I want to provide a minimal example in terms of fi_pingpong.

The crash that I see on the reading side, when enabling verbose SST logging:

Sst set to use sockets as a Control Transport
RDMA Dataplane sees interface mlx4_0, provider type verbs;ofi_rxm, which should work.
RDMA Dataplane sees interface mlx4_0, provider type verbs;ofi_rxm, which should work.
RDMA Dataplane sees interface mlx4_0, provider type verbs;ofi_rxm, which should work.
RDMA Dataplane evaluating viability, returning priority 10
Considering DataPlane "evpath" for possible use, priority is 1
Considering DataPlane "rdma" for possible use, priority is 10
Selecting DataPlane "rdma", priority 10 for use
Looking for writer contact in file ./electrons.sst, with timeout 60 secs
seeing candidate fabric verbs;ofi_rxm, will use this unless we see something better.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0:1, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to lo, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to lo, but it may not be stable or performant.
ignoring fabric verbs;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to mlx4_0, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0:1, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to lo, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to lo, but it may not be stable or performant.
ignoring fabric verbs;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to mlx4_0, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0:1, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to lo, but it may not be stable or performant.
ignoring fabric tcp;ofi_rxm because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to lo, but it may not be stable or performant.
ignoring fabric sockets because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0, but it may not be stable or performant.
ignoring fabric sockets because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0:1, but it may not be stable or performant.
ignoring fabric sockets because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to enp129s0f0, but it may not be stable or performant.
ignoring fabric sockets because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to lo, but it may not be stable or performant.
ignoring fabric sockets because it's not of a supported type. It may work to force this fabric to be used by setting FABRIC_IFACE to lo, but it may not be stable or performant.
Fabric parameters to use at fabric initialization: fi_info:
    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM ]
    mode: [ FI_LOCAL_MR ]
    addr_format: FI_SOCKADDR_IB
    src_addrlen: 48
    dest_addrlen: 0
    src_addr: fi_sockaddr_ib://[fe80::2:c903:1b:bf21]:0xffff:0x13f:0x0
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_SEND ]
        mode: [ FI_LOCAL_MR ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        inject_size: 16320
        size: 1024
        iov_limit: 4
        rma_iov_limit: 1
    fi_rx_attr:
        caps: [ FI_MSG, FI_RMA, FI_RECV, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        total_buffered_recv: 0
        size: 1024
        iov_limit: 4
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_RXM
        protocol_version: 1
        max_msg_size: 1073741824
        msg_prefix_size: 0
        max_order_raw_size: 1073741824
        max_order_war_size: 0
        max_order_waw_size: 1073741824
        mem_tag_format: 0xaaaaaaaaaaaaaaaa
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        auth_key_size: 0
    fi_domain_attr:
        domain: 0x0
        name: mlx4_0
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_AUTO
        data_progress: FI_PROGRESS_AUTO
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_BASIC ]
        mr_key_size: 4
        cq_data_size: 4
        cq_cnt: 65536
        ep_cnt: 32768
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        max_ep_tx_ctx: 1
        max_ep_rx_ctx: 1
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 0
        mr_iov_limit: 1
    caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
    mode: [  ]
        auth_key_size: 0
        max_err_data: 0
        mr_cnt: 0
    fi_fabric_attr:
        name: IB-0xfe80000000000000
        prov_name: verbs;ofi_rxm
        prov_version: 111.10
        api_version: 1.5
    fid_nic:
        fi_device_attr:
            name: mlx4_0
            device_id: 0x1003
            device_version: 1
            vendor_id: 0x02c9
            driver: (null)
            firmware: 2.30.8000
        fi_bus_attr:
            fi_bus_type: FI_BUS_UNKNOWN
        fi_link_attr:
            address: (null)
            mtu: 4096
            speed: 56000000000
            state: FI_LINK_UP
            network_type: InfiniBand

accessing domain failed with -61 (Unknown error -61). This is fatal.

Backtrace of the error:

 0  /trinity/shared/pkg/mpi/ucx/1.10.0/gcc/7.3.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2aaada412504]
 1  /trinity/shared/pkg/mpi/ucx/1.10.0/gcc/7.3.0/lib/libucs.so.0(+0x2782c) [0x2aaada41282c]
 2  /trinity/shared/pkg/mpi/ucx/1.10.0/gcc/7.3.0/lib/libucs.so.0(+0x27a94) [0x2aaada412a94]
 3  /trinity/shared/pkg/filelib/adios/2.7.1-cuda112/gcc/7.3.0/openmpi/4.0.4-cuda112/lib64/libadios2_core.so.2(+0x5100b7) [0x2aaab4fd90b7]
 4  /trinity/shared/pkg/filelib/adios/2.7.1-cuda112/gcc/7.3.0/openmpi/4.0.4-cuda112/lib64/libadios2_core.so.2(SstReaderOpen+0x124) [0x2aaab4fc3334]
 5  /trinity/shared/pkg/filelib/adios/2.7.1-cuda112/gcc/7.3.0/openmpi/4.0.4-cuda112/lib64/libadios2_core.so.2(adios2::core::engine::SstReader::SstReader(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0xd0) [0x2aaab4f01e10]
 6  /trinity/shared/pkg/filelib/adios/2.7.1-cuda112/gcc/7.3.0/openmpi/4.0.4-cuda112/lib64/libadios2_core.so.2(std::shared_ptr<adios2::core::Engine> adios2::core::IO::MakeEngine<adios2::core::engine::SstReader>(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0x72) [0x2aaab4c40922]
 7  /trinity/shared/pkg/filelib/adios/2.7.1-cuda112/gcc/7.3.0/openmpi/4.0.4-cuda112/lib64/libadios2_core_mpi.so.2(std::_Function_handler<std::shared_ptr<adios2::core::Engine> (adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm), std::shared_ptr<adios2::core::Engine> (*)(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)>::_M_invoke(std::_Any_data const&, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode&&, adios2::helper::Comm&&)+0x3f) [0x2aaab47ea3af]
 8  /trinity/shared/pkg/filelib/adios/2.7.1-cuda112/gcc/7.3.0/openmpi/4.0.4-cuda112/lib64/libadios2_core.so.2(adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0x5a6) [0x2aaab4c3e5b6]
 9  /trinity/shared/pkg/filelib/adios/2.7.1-cuda112/gcc/7.3.0/openmpi/4.0.4-cuda112/lib64/libadios2_core.so.2(adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode)+0x59) [0x2aaab4c3eda9]
10  /trinity/shared/pkg/filelib/adios/2.7.1-cuda112/gcc/7.3.0/openmpi/4.0.4-cuda112/lib64/libadios2_cxx11.so.2(adios2::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode)+0x12d) [0x2aaab345e48d]

Selecting "DataTransport" = "WAN" manually results in a working (but not particularly performant) staging setup.

From this, I gather that ADIOS2 uses the verbs fabric and the rdm endpoint to establish a connection. From what limited knowledge I have of libfabric, I tried to replicate the situation using fi_pingpong:

> fi_pingpong -p verbs -e rdm & sleep 1; fi_pingpong localhost -p verbs -e rdm
[1] 27854
fi_domain(): util/pingpong.c:1366, ret=-61 (No data available)
[error] util/pingpong.c:521 : ctrl/read: no data or remote connection closed
[1]+  Exit 61                 fi_pingpong -p verbs -e rdm

This results in the same error code -61 on the reader's end.

Questions:

  1. Is it possible/sensible to use fi_pingpong to check libfabric for RDMA support in ADIOS2/SST? If not, are there alternatives?
  2. Is the above configuration mimicking correctly what ADIOS2 tries to do? If not, how can I achieve this?

Additional context

For having a comparison, I tried to check on Summit and it turned out that the installed libfabric and ADIOS2 modules (libfabric/1.7.0 and adios2/2.7.0) apparently do not support RDMA-based staging either?

Sst set to use sockets as a Control Transport
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
Prefered dataplane name is "rdma"
Considering DataPlane "evpath" for possible use, priority is 1
Considering DataPlane "rdma" for possible use, priority is -1
Warning:  Perferred DataPlane "rdma" is not available.
Selecting DataPlane "evpath", priority 1 for use
RDMA Dataplane unloading

The above instance of fi_pingpong results in the exact same error.

Also, this might help issue #2601 in case the admins there ever get libfabric installed

philip-davis commented 3 years ago

Hello,

fi_pingpong with those arguments is a good approximation, but there are few more constraints SST puts on libfabric that fi_pingpong doesn't. For a failure at init like you are seeing, this shouldn't make a difference, but it might mean there are some fabrics the fi_pingpong discovers that SST wouldn't. Below is some code that should reproduce the init sequence (but not actual data transfer).

I'm not positive, but you might be seeing this libfabric issue. If so, I think 1.7.2 has a fix for the 1.7.x branch. You might try disabling the MR cache as well:

export FI_MR_CACHE_MAX_COUNT=0

Other than that, I am concerned about the address of the ib adaptor being IPv6 link local, since there are some libfabric issues with that, but I'm not sure what a good way to address that at this point.

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

#include<mpi.h>

#include <rdma/fabric.h>
#include <rdma/fi_cm.h>
#include <rdma/fi_domain.h>
#include <rdma/fi_endpoint.h>
#include <rdma/fi_rma.h>

#ifdef SST_HAVE_FI_GNI
#include <rdma/fi_ext_gni.h>
#ifdef SST_HAVE_CRAY_DRC
#include <rdmacred.h>

#define DP_DRC_MAX_TRY 60
#define DP_DRC_WAIT_USEC 1000000

#endif /* SST_HAVE_CRAY_DRC */
#endif /* SST_HAVE_FI_GNI */

#define DP_AV_DEF_SIZE 512

struct fabric_state
{
    struct fi_context *ctx;
    struct fi_info *info;
    int local_mr_req;
    int rx_cq_data;
    size_t addr_len;
    size_t msg_prefix_size;
    struct fid_fabric *fabric;
    struct fid_domain *domain;
    struct fid_ep *signal;
    struct fid_cq *cq_signal;
    struct fid_av *av;
    pthread_t listener;
#ifdef SST_HAVE_CRAY_DRC
    drc_info_handle_t drc_info;
    uint32_t credential;
    struct fi_gni_auth_key *auth_key;
#endif /* SST_HAVE_CRAY_DRC */
};

static void init_fabric(struct fabric_state *fabric, const char *ifname)
{
    struct fi_info *hints, *info, *originfo, *useinfo;
    struct fi_av_attr av_attr = {0};
    struct fi_cq_attr cq_attr = {0};
    int result;

    hints = fi_allocinfo();
    hints->caps = FI_MSG | FI_SEND | FI_RECV | FI_REMOTE_READ |
                  FI_REMOTE_WRITE | FI_RMA | FI_READ | FI_WRITE;
    hints->mode = FI_CONTEXT | FI_LOCAL_MR | FI_CONTEXT2 | FI_MSG_PREFIX |
                  FI_ASYNC_IOV | FI_RX_CQ_DATA;
    hints->domain_attr->mr_mode = FI_MR_BASIC;
    hints->domain_attr->control_progress = FI_PROGRESS_AUTO;
    hints->domain_attr->data_progress = FI_PROGRESS_AUTO;
    hints->ep_attr->type = FI_EP_RDM;

    fabric->info = NULL;

    fprintf(stderr, "INFO: initialzing fabric...\n");

    fi_getinfo(FI_VERSION(1, 5), NULL, NULL, 0, hints, &info);
    if (!info)
    {
        fprintf(stderr, "ERROR: no fabrics detected.\n");
        fabric->info = NULL;
        return;
    }
    fi_freeinfo(hints);

    originfo = info;
    useinfo = NULL;
    while (info)
    {
        char *prov_name = info->fabric_attr->prov_name;
        char *domain_name = info->domain_attr->name;

        if (ifname && strcmp(ifname, domain_name) == 0)
        {
            fprintf(stderr, "INFO: using specified interface '%s'.\n", ifname);
            useinfo = info;
            break;
        }
        if ((((strcmp(prov_name, "verbs") == 0) && info->src_addr) ||
             (strcmp(prov_name, "gni") == 0) ||
             (strcmp(prov_name, "psm2") == 0)) &&
            (!useinfo || !ifname ||
             (strcmp(useinfo->domain_attr->name, ifname) != 0)))
        {
            fprintf(stderr, "INFO: seeing candidate fabric %s, will use this unless we see something better.\n", prov_name);
            useinfo = info;
        }
        else if (((strstr(prov_name, "verbs") && info->src_addr) ||
                  strstr(prov_name, "gni") || strstr(prov_name, "psm2")) &&
                 !useinfo)
        {
            fprintf(stderr, "INFO: seeing candidate fabric %s, will use this unless we see something better.\n", prov_name);
            useinfo = info;
        }
        else
        {
                fprintf(stderr, "ignoring fabric %s because it's not of a supported type.", prov_name);
        }
        info = info->next;
    }

    info = useinfo;

    if (!info)
    {
        fprintf(stderr, "ERROR: "
            "none of the usable system fabrics are supported high speed "
            "interfaces (verbs, gni, psm2.) To use a compatible fabric that is "
            "being ignored (probably sockets), set the environment variable "
            "FABRIC_IFACE to the interface name. Check the output of fi_info "
            "to troubleshoot this message.\n");
        fabric->info = NULL;
        return;
    }

    if (info->mode & FI_CONTEXT2)
    {
        fabric->ctx = calloc(2, sizeof(*fabric->ctx));
    }
    else if (info->mode & FI_CONTEXT)
    {
        fabric->ctx = calloc(1, sizeof(*fabric->ctx));
    }
    else
    {
        fabric->ctx = NULL;
    }

    if (info->mode & FI_LOCAL_MR)
    {
        fabric->local_mr_req = 1;
    }
    else
    {
        fabric->local_mr_req = 0;
    }

    if (info->mode & FI_MSG_PREFIX)
    {
        fabric->msg_prefix_size = info->ep_attr->msg_prefix_size;
    }
    else
    {
        fabric->msg_prefix_size = 0;
    }

    if (info->mode & FI_RX_CQ_DATA)
    {
        fabric->rx_cq_data = 1;
    }
    else
    {
        fabric->rx_cq_data = 0;
    }

    fabric->addr_len = info->src_addrlen;

    info->domain_attr->mr_mode = FI_MR_BASIC;
#ifdef SST_HAVE_CRAY_DRC
    if (strstr(info->fabric_attr->prov_name, "gni") && fabric->auth_key)
    {
        info->domain_attr->auth_key = (uint8_t *)fabric->auth_key;
        info->domain_attr->auth_key_size = sizeof(struct fi_gni_raw_auth_key);
    }
#endif /* SST_HAVE_CRAY_DRC */
    fabric->info = fi_dupinfo(info);
    if (!fabric->info)
    {
        fprintf(stderr,
                      "ERROR: copying the fabric info failed.\n");
        return;
    }

        fprintf(stderr,
        "INFO: fabric parameters to use at fabric initialization: %s\n",
                  fi_tostr(fabric->info, FI_TYPE_INFO));

    result = fi_fabric(info->fabric_attr, &fabric->fabric, fabric->ctx);
    if (result != FI_SUCCESS)
    {
        fprintf(stderr,
            "ERROR: opening fabric access failed with %d (%s). This is fatal.\n",
            result, fi_strerror(result));
        return;
    }
    result = fi_domain(fabric->fabric, info, &fabric->domain, fabric->ctx);
    if (result != FI_SUCCESS)
    {
       fprintf(stderr,
            "ERROR: accessing domain failed with %d (%s). This is fatal.\n",
                      result, fi_strerror(result));
        return;
    }
    info->ep_attr->type = FI_EP_RDM;
    result = fi_endpoint(fabric->domain, info, &fabric->signal, fabric->ctx);
    if (result != FI_SUCCESS || !fabric->signal)
    {
        fprintf(stderr,
            "ERROR: opening endpoint failed with %d (%s). This is fatal.\n",
                      result, fi_strerror(result));
        return;
    }

    av_attr.type = FI_AV_MAP;
    av_attr.count = DP_AV_DEF_SIZE;
    av_attr.ep_per_node = 0;
    result = fi_av_open(fabric->domain, &av_attr, &fabric->av, fabric->ctx);
    if (result != FI_SUCCESS)
    {
        fprintf(stderr,
                      "ERROR: could not initialize address vector, failed with %d "
                      "(%s). This is fatal.\n",
                      result, fi_strerror(result));
        return;
    }
    result = fi_ep_bind(fabric->signal, &fabric->av->fid, 0);
    if (result != FI_SUCCESS)
    {
        fprintf(stderr,
            "ERROR: could not bind endpoint to address vector, failed with "
                      "%d (%s). This is fatal.\n",
                      result, fi_strerror(result));
        return;
    }

    cq_attr.size = 0;
    cq_attr.format = FI_CQ_FORMAT_DATA;
    cq_attr.wait_obj = FI_WAIT_UNSPEC;
    cq_attr.wait_cond = FI_CQ_COND_NONE;
    result =
        fi_cq_open(fabric->domain, &cq_attr, &fabric->cq_signal, fabric->ctx);
    if (result != FI_SUCCESS)
    {
        fprintf(stderr,
            "ERROR: opening completion queue failed with %d (%s). This is fatal.\n",
            result, fi_strerror(result));
        return;
    }

    result = fi_ep_bind(fabric->signal, &fabric->cq_signal->fid,
                        FI_TRANSMIT | FI_RECV);
    if (result != FI_SUCCESS)
    {
       fprintf(stderr,
            "ERROR: could not bind endpoint to completion queue, failed "
                      "with %d (%s). This is fatal.\n",
                      result, fi_strerror(result));
        return;
    }

    result = fi_enable(fabric->signal);
    if (result != FI_SUCCESS)
    {
        fprintf(stderr,
                      "ERROR: enable endpoint, failed with %d (%s). This is fatal.\n",
                      result, fi_strerror(result));
        return;
    }

    fprintf(stderr, "INFO: fabric successfully initialized.\n");

    fi_freeinfo(originfo);
}

static void fini_fabric(struct fabric_state *fabric)
{

    int res;

    fprintf(stderr, "INFO: finalizing fabric...\n");

    do
    {
        res = fi_close((struct fid *)fabric->signal);
    } while (res == -FI_EBUSY);

    if (res != FI_SUCCESS)
    {
        fprintf(stderr,
                      "ERROR: could not close ep, failed with %d (%s).\n", res,
                      fi_strerror(res));
        return;
    }

    res = fi_close((struct fid *)fabric->cq_signal);
    if (res != FI_SUCCESS)
    {
        fprintf(stderr,
                      "ERROR: could not close cq, failed with %d (%s).\n", res,
                      fi_strerror(res));
    }

    res = fi_close((struct fid *)fabric->av);
    if (res != FI_SUCCESS)
    {
        fprintf(stderr,
                      "ERROR: could not close av, failed with %d (%s).\n", res,
                      fi_strerror(res));
    }
    res = fi_close((struct fid *)fabric->domain);
    if (res != FI_SUCCESS)
    {
        fprintf(stderr,
                      "ERROR: could not close domain, failed with %d (%s).\n", res,
                      fi_strerror(res));
        return;
    }

    res = fi_close((struct fid *)fabric->fabric);
    if (res != FI_SUCCESS)
    {
        fprintf(stderr,
                      "ERROR: could not close fabric, failed with %d (%s).\n", res,
                      fi_strerror(res));
        return;
    }

    fi_freeinfo(fabric->info);

    if (fabric->ctx)
    {
        free(fabric->ctx);
    }

#ifdef SST_HAVE_CRAY_DRC
    if (Fabric->auth_key)
    {
        free(Fabric->auth_key);
    }
#endif /* SST_HAVE_CRAY_DRC */

    fprintf(stderr, "finalized fabric.\n");
}

int main(int argc, char **argv)
{
    struct fabric_state fabric;
    const char *ifname;

    ifname = NULL;
    if(argc > 1) {
        ifname = argv[1];
    }

    init_fabric(&fabric, ifname);
    fini_fabric(&fabric);

    return(0);
}
franzpoeschel commented 3 years ago

Thank you for the example code! This is very useful for testing configurations and the capabilities of a cluster. I will forward this to our admins and hope that this helps in resolving the issues.

Unfortunately, neither updating libfabric to the most recent version 1.12.1 (from 1.11.1) nor setting the environment variable FI_MR_CACHE_MAX_COUNT=0 helped resolve the error, but your code can reproduce it. Fwiw, the error is produced by the call result = fi_domain(fabric->fabric, info, &fabric->domain, fabric->ctx);.

philip-davis commented 3 years ago

With the updated libfabric, does the error message from ADIOS2/the sample code change? The error code fi_domain is failing with on 1.7.0 (-61) is not one of the expected return values, which is interesting. I'm curious if the later versions fail differently.

franzpoeschel commented 3 years ago

No, the error code stays the same across versions. With the invocation of fi_pingpong (fi_pingpong -p verbs -e rdm & sleep 1; fi_pingpong localhost -p verbs -e rdm), I can even reproduce it across systems (I have confirmed this on my local machine as well as on Summit).

franzpoeschel commented 3 years ago

I dug a bit further into this. The error code stems from statements such as return -FI_ENODATA which libfabric apparently likes to return negated for some reason. Particularly, the -61 that I am seeing is generated at this line, suggesting that libfabric was unable to find a satisfying provider. The backtrace for that is:

#0  fi_getinfo_ (version=65541, node=0x0, service=0x0, flags=576460752303423488, hints=0x694280, info=0x7fffffff72d0)
#1  0x00002aaaaad17309 in ofi_get_core_info (version=65541, node=0x0, service=0x0, flags=0, util_prov=0x2aaaab17d3e0 <rxm_util_prov>, util_
hints=0x68f000, base_attr=0x0, info_to_core=0x2aaaaad8d847 <rxm_info_to_core>, core_info=0x7fffffff72d0)
#2  0x00002aaaaad90e41 in rxm_domain_open (fabric=0x6929b0, info=0x68f000, domain=0x7fffffff7430, context=0x0)
:528
#3  0x0000000000400b20 in fi_domain (fabric=0x6929b0, info=0x68f000, domain=0x7fffffff7430, context=0x0)
clude/rdma/fi_domain.h:308
#4  0x0000000000401315 in init_fabric (fabric=0x7fffffff7400, ifname=0x0)
:220
#5  0x0000000000401825 in main (argc=1, argv=0x7fffffff7548)

So, the failure is produced while opening the domain. The verbose log from setting FI_LOG_LEVEL=Info:

libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 mlx4_0
libfabric:28148:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: mlx4_0
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 mlx4_0
libfabric:28148:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: mlx4_0
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 mlx4_0-xrc
libfabric:28148:verbs:fabric:vrb_get_matching_info():1531<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #4 mlx4_0-xrc
libfabric:28148:verbs:fabric:vrb_get_matching_info():1531<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #5 mlx4_0-dgram
libfabric:28148:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: mlx4_0-dgram
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #6 mlx4_0-dgram
libfabric:28148:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: mlx4_0-dgram
libfabric:28148:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_rxd
libfabric:28148:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_mrail
libfabric:28148:core:core:fi_fabric_():1220<info> Opened fabric: IB-0xfe80000000000000
libfabric:28148:core:core:fi_fabric_():1220<info> Opened fabric: IB-0xfe80000000000000
libfabric:28148:ofi_rxm:core:fi_param_get_():280<info> variable use_srx=<not set>
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #1 mlx4_0
libfabric:28148:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: mlx4_0
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #2 mlx4_0
libfabric:28148:verbs:fabric:vrb_get_matching_info():1554<info> adding fi_info for domain: mlx4_0
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #3 mlx4_0-xrc
libfabric:28148:verbs:core:vrb_check_hints():264<info> skipping device mlx4_0-xrc (want mlx4_0)
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #4 mlx4_0-xrc
libfabric:28148:verbs:core:vrb_check_hints():264<info> skipping device mlx4_0-xrc (want mlx4_0)
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #5 mlx4_0-dgram
libfabric:28148:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:28148:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:28148:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:28148:verbs:fabric:vrb_get_matching_info():1509<info> checking domain: #6 mlx4_0-dgram
libfabric:28148:verbs:core:ofi_check_ep_type():658<info> unsupported endpoint type
libfabric:28148:verbs:core:ofi_check_ep_type():659<info> Supported: FI_EP_DGRAM
libfabric:28148:verbs:core:ofi_check_ep_type():659<info> Requested: FI_EP_MSG
libfabric:28148:verbs:fabric:vrb_get_rai_id():297<info> rdma_resolve_addr: Invalid argument(22)
libfabric:28148:verbs:fabric:vrb_get_rai_id():299<info> src addr: fi_sockaddr_ib://[fe80::2:c903:1b:be91]:0xffff:0x13f:0x0
libfabric:28148:verbs:fabric:vrb_get_rai_id():301<info> dst addr: (null)
libfabric:28148:verbs:fabric:vrb_get_match_infos():1778<info> handling of the socket address fails - -22
libfabric:28148:verbs:core:vrb_get_match_infos():1798<info> Handling of the addresses fails, the getting infos is unsuccessful
libfabric:28148:core:core:fi_getinfo_():1021<warn> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:28148:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_rxd
libfabric:28148:core:core:ofi_layering_ok():915<info> Need core provider, skipping ofi_mrail
ERROR: accessing domain failed with -61 (Unknown error -61). This is fatal.
INFO: finalizing fabric..

I noticed further that (a) ADIOS2 explicitly supports the psm2 provider which according to the documentation is the "High-speed InfiniBand networking from Intel". Our local installation of libfabric finds a psm3 provider, which I tried activating. Your code sample above can connect with that fine and finishes cleanly, but trying to use it in ADIOS2 leads to a failure on the reading side:

[…]
Writer is doing BP-based marshalling
Writer is using Minimum Connection Communication pattern (min)
Received contact info for WS_stream 0x42ecad0, WSR Rank 0
Sending Reader Activate messages to writer
Finish opening Stream "openPMD/simData", starting with Step number 0
Wait for next metadata after last timestep -1
Waiting for metadata for a Timestep later than TS -1
(PID 7e3e, TID 2aaaaaae5900) Stream status is Established
Received a Timestep metadata message for timestep 0, signaling condition
Examining metadata for Timestep 0
Returning metadata for Timestep 0
Setting TSmsg to Rootentry value
RdmaTimestepArrived with Timestep = 0, PreloadMode = 0
SstAdvanceStep returning Success on timestep 0
Performing remote read of Writer Rank 0 at step 0
Block address is 0x2aad90423010, with a key of 2
Remote read target is Rank 0 (Offset = 603988229, Length = 6)
[kepler025:32318:0:32318] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)

The address 0x8 looks to me like there is a nullpointer being involved somewhere. The fabric parameters that ADIOS2 configured look like this:

Fabric parameters to use at fabric initialization: fi_info:
    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
    mode: [ FI_CONTEXT ]
    addr_format: FI_ADDR_PSMX3
    src_addrlen: 16
    dest_addrlen: 16
    src_addr: fi_addr_psmx3://ffff:102
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_SEND, FI_TRIGGER ]
        mode: [ FI_CONTEXT ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        inject_size: 64
        size: 18446744073709551615
        iov_limit: 13
        rma_iov_limit: 1
    fi_rx_attr:
        caps: [ FI_MSG, FI_RMA, FI_RECV, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_RMA_EVENT, FI_SOURCE ]
        mode: [ FI_CONTEXT ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        total_buffered_recv: 0
        size: 18446744073709551615
        iov_limit: 1
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_PSMX3
        protocol_version: 768
        max_msg_size: 4294963200
        msg_prefix_size: 0
        max_order_raw_size: 4096
        max_order_war_size: 4096
        max_order_waw_size: 4096
        mem_tag_format: 0x0aaaaaaaaaaaaaaa
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        auth_key_size: 16
    fi_domain_attr:
        domain: 0x0
        name: mlx4_1
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_AUTO
        data_progress: FI_PROGRESS_AUTO
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_BASIC ]
        mr_key_size: 8
        cq_data_size: 4
        cq_cnt: 65535
        ep_cnt: 65535
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        max_ep_tx_ctx: 1
        max_ep_rx_ctx: 1
        max_ep_stx_ctx: 1
        max_ep_srx_ctx: 0
        cntr_cnt: 65535
        mr_iov_limit: 65535
    caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SHARED_AV ]
    mode: [  ]
        auth_key_size: 16
        max_err_data: 64
        mr_cnt: 65535
    fi_fabric_attr:
        name: psm3
        prov_name: psm3
        prov_version: 112.10
        api_version: 1.5
    fid_nic:
        fi_device_attr:
            name: (null)
            device_id: (null)
            device_version: (null)
            vendor_id: (null)
            driver: (null)
            firmware: (null)
        fi_bus_attr:
            fi_bus_type: FI_BUS_PCI
            fi_pci_attr:
                domain_id: 0
                bus_id: 129
                device_id: 0
                function_id: 0
        fi_link_attr:
            address: (null)
            mtu: 0
            speed: 0
            state: FI_LINK_UNKNOWN
            network_type: (null)

I assume that the network_type must be non-null for this to work or does this rather have to do with the missing psm3 support in ADIOS2/SST?

franzpoeschel commented 3 years ago

Is it possible that this is not an issue of our local cluster, but rather an issue of either libfabric or its configuration in ADIOS2? I've been able to reproduce the same crashing behavior on Summit now. Here is a minimal reproducible example for Summit: libfabric-minimal-example-for-adios2-summit.zip

By running bsub job.sh from the extracted archive (after supplying your project ID), I get the following output in build/log:

INFO: initialzing fabric...
INFO: seeing candidate fabric verbs;ofi_rxm, will use this unless we see something better.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric verbs;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric tcp;ofi_rxm because it's not of a supported type.
ignoring fabric sockets because it's not of a supported type.
ignoring fabric sockets because it's not of a supported type.
ignoring fabric sockets because it's not of a supported type.
ignoring fabric sockets because it's not of a supported type.
ignoring fabric sockets because it's not of a supported type.
ignoring fabric sockets because it's not of a supported type.
ignoring fabric sockets because it's not of a supported type.
ignoring fabric sockets because it's not of a supported type.
INFO: fabric parameters to use at fabric initialization: fi_info:
    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM ]
    mode: [ FI_LOCAL_MR ]
    addr_format: FI_SOCKADDR_IB
    src_addrlen: 48
    dest_addrlen: 0
    src_addr: fi_sockaddr_ib://[fe80::ec0d:9a03:7f:55b8]:0xffff:0x13f:0x0
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_SEND ]
        mode: [ FI_LOCAL_MR ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        inject_size: 16384
        size: 65536
        iov_limit: 4
        rma_iov_limit: 1
    fi_rx_attr:
        caps: [ FI_MSG, FI_RMA, FI_RECV, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
        comp_order: [ FI_ORDER_NONE ]
        total_buffered_recv: 0
        size: 65536
        iov_limit: 4
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_RXM
        protocol_version: 1
        max_msg_size: 1073741824
        msg_prefix_size: 0
        max_order_raw_size: 1073741824
        max_order_war_size: 0
        max_order_waw_size: 1073741824
        mem_tag_format: 0xaaaaaaaaaaaaaaaa
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        auth_key_size: 0
    fi_domain_attr:
        domain: 0x0
        name: mlx5_2
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_AUTO
        data_progress: FI_PROGRESS_AUTO
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_BASIC ]
        mr_key_size: 4
        cq_data_size: 4
        cq_cnt: 65536
        ep_cnt: 32768
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        max_ep_tx_ctx: 1
        max_ep_rx_ctx: 1
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 0
        mr_iov_limit: 1
    caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
    mode: [  ]
        auth_key_size: 0
        max_err_data: 0
        mr_cnt: 0
    fi_fabric_attr:
        name: IB-0xfe80000000000000
        prov_name: verbs;ofi_rxm
        prov_version: 112.10
        api_version: 1.5
    fid_nic:
        fi_device_attr:
            name: mlx5_2
            device_id: 0x1019
            device_version: 0
            vendor_id: 0x02c9
            driver: (null)
            firmware: 16.26.4400
        fi_bus_attr:
            fi_bus_type: FI_BUS_UNKNOWN
        fi_link_attr:
            address: (null)
            mtu: 4096
            speed: 100000000000
            state: FI_LINK_UP
            network_type: InfiniBand

ERROR: accessing domain failed with -61 (No data available). This is fatal.
INFO: finalizing fabric...

Notice that I build libfabric by myself in the supplied script. Using the supplied libfabric/1.7.0-sysrdma on Summit results in the output:

INFO: initialzing fabric...
ERROR: no fabrics detected.
INFO: finalizing fabric...

(Similarly, it is necessary to run this within a batch job and not from a login node, otherwise no RDMA fabric will be found. Hence bsub job.sh instead of just ./job.sh)

Does there exist a working RDMA setup for ADIOS2 on Summit?

eisenhauer commented 3 years ago

For Summit, I believe that Junmin confirmed last month that libfabric 1.9.0 worked for her, but the prebuilt module and various other versions had problems there. Unfortunately, this is fragile stuff...

guj commented 3 years ago

libfabric 1.9: my copy is here: /gpfs/alpine/world-shared/csc303/junmin/libfabric-1.9.0.tar

guj commented 3 years ago

my env (modify accordingly):

export LF_HOME=/ccs/home/junmin/work/libfabric/install_v1.9 export PATH=${PATH}:${LF_HOME}/bin export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${LF_HOME}/lib

export LIBRARY_PATH=${LIBRARY_PATH}:${LF_HOME}/lib export CPATH=${CPATH}:${LF_HOME}/include export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:${LF_HOME}/lib/pkgconfig export CMAKE_PREFIX_PATH=${CMAKE_PREFIX_PATH}:${LF_HOME} export OLCF_LIBFABRIC_ROOT=libfabric-1.9.0

franzpoeschel commented 3 years ago

I modified my test script now to (1) use libfabric 1.9.1 and (2) not load the rdma-core module.

#!/usr/bin/env bash
#BSUB -q batch
#BSUB -W 1:00
#BSUB -P "$PROJECT_ID" 
#BSUB -ln_slots 1
#BSUB -o stdout.%J
#BSUB -e stderr.%J

module purge
module load gcc/8.1.1 spectrum-mpi/10.3.1.2-20200121 cmake/3.18.2
mkdir -p build
cd build
export LD_LIBRARY_PATH="$(pwd)/local/lib:$LD_LIBRARY_PATH"
export PKG_CONFIG_PATH="$(pwd)/local/lib/pkgconfig:$PKG_CONFIG_PATH"

# quickly and dirtily build libfabric
wget -nc https://github.com/ofiwg/libfabric/releases/download/v1.9.1/libfabric-1.9.1.tar.bz2
tar -xjf libfabric-1.9.1.tar.bz2
cd libfabric-1.9.1
./configure --prefix="$(pwd)/../local"
make -j 2 install
cd ..

cmake ..
make -j 2
./main > log 2>&1

This one passed the quick check without error, so it seems that this might be a valid solution. Will report back once I've tried this with ADIOS2.

franzpoeschel commented 3 years ago

Alright, I have been able to run a RDMA-based loosely coupled application on Summit now with this and the weird errors on our local cluster don't occur with that version either. So I guess, the purpose of this issue (having a small libfabric-based reproducer for ADIOS2 features) is done. It also looks like newer versions of libfabric are not compatible with ADIOS2.

But I noticed on Summit, that the RDMA-based transport only works if doing the stream on one single node, once I try to go multi-node I will get error 113 (Host is down) on the reading end while the writing end runs to the end cleanly. Do you have a quick idea what could be causing this and how I could fix it? (Should I close this issue and open a new one?)

EDIT: I now see that that was actually a secondary error. On the writer's side, I forgot setting a "QueueLimit" and the writer just continued until its memory ran full. The actual issue was that a engine.BeginStep() on the reader's side never returns. It works fine on two nodes, but on 10 nodes it just hangs.