ornladios / ADIOS2

Next generation of ADIOS developed in the Exascale Computing Program
https://adios2.readthedocs.io/en/latest/index.html
Apache License 2.0
267 stars 125 forks source link

Libfabric issues #2887

Open pnorbert opened 3 years ago

pnorbert commented 3 years ago

This ticket is to track who is doing what to fix the build/installation/usage problems of libfabric.

Existing problems:

pnorbert commented 3 years ago

@ax3l @franzpoeschel @suchyta1

halehawk commented 2 years ago

I am trying to use libfabric1.12 and adios2 now, and got problem "psm2_ep_connect returned error Operation timed out" when running on different nodes. Does it mean I shall not use this version of libfabric at all?

eisenhauer commented 2 years ago

There are a couple of important run-time environment variables to use, at least on summit. Given those, the version of libfabric is less important. Try running with: export FABRIC_IFACE=mlx5_0 export FI_OFI_RXM_USE_SRX=1

philip-davis commented 2 years ago

Is this a machine that has a PSM fabric? It might have two PSM interfaces and libfabric is using the wrong one. Would it be possible to get the output of ‘fi_info -e FI_EP_RDM’ run on a compute node?

Sent from my mobile.

On Oct 8, 2021, at 4:19 PM, haiying xu @.***> wrote:

 I am trying to use libfabric1.12 and adios2 now, and got problem "psm2_ep_connect returned error Operation timed out" when running on different nodes. Does it mean I shall not use this version of libfabric at all?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

halehawk commented 2 years ago

From fi_info, I only see psm3 provider.

Here are PSM3 env I got from fi_info -e FI_EP_RDM:

# FI_PSM3_NAME_SERVER: Boolean (0/1, on/off, true/false, yes/no) # psm3: Whether to turn on the name server or not (default: yes)

# FI_PSM3_TAGGED_RMA: Boolean (0/1, on/off, true/false, yes/no) # psm3: Whether to use tagged messages for large size RMA or not (default: yes)

# FI_PSM3_UUID: String # psm3: Unique Job ID required by the fabric

# FI_PSM3_DELAY: Integer # psm3: Delay (seconds) before finalization (for debugging)

# FI_PSM3_TIMEOUT: Integer # psm3: Timeout (seconds) for gracefully closing the PSM3 endpoint

# FI_PSM3_CONN_TIMEOUT: Integer # psm3: Timeout (seconds) for establishing connection between two PSM3 endpoints

# FI_PSM3_PROG_INTERVAL: Integer # psm3: Interval (microseconds) between progress calls made in the progress thread (default: 1 if affinity is set, 1000 if not)

# FI_PSM3_PROG_AFFINITY: String # psm3: When set, specify the set of CPU cores to set the progress thread affinity to. The format is [:[:]][,[:[:]]]*, where each triplet :: defines a block of core_ids. Both and can be either the core_id (when >=0) or core_id - num_cores (when <0). (default: affinity not set)

# FI_PSM3_INJECT_SIZE: Integer # psm3: Maximum message size for fi_inject and fi_tinject (default: 64).

# FI_PSM3_LOCK_LEVEL: Integer # psm3: How internal locking is used. 0 means no locking. (default: 2).

# FI_PSM3_LAZY_CONN: Boolean (0/1, on/off, true/false, yes/no) # psm3: Whether to force lazy connection mode. (default: no).

# FI_PSM3_DISCONNECT: Boolean (0/1, on/off, true/false, yes/no) # psm3: Whether to issue disconnect request when process ends (default: no).

# FI_PSM3_TAG_LAYOUT: String # psm3: How the 96 bit PSM3 tag is organized: tag60 means 32/4/60 for data/flags/tag;tag64 means 4/28/64 for `flags/data/tag (default: tag60).

philip-davis commented 2 years ago

Apologies, I gave you the wrong flag. It should be ‘fi_info -t FI_EP_RDM’ (note the -t instead of -e)

Sent from my mobile.

On Oct 8, 2021, at 5:09 PM, haiying xu @.***> wrote:

 From fi_info, I only see psm3 provider. `# FI_PSM3_NAME_SERVER: Boolean (0/1, on/off, true/false, yes/no)

psm3: Whether to turn on the name server or not (default: yes)

FI_PSM3_TAGGED_RMA: Boolean (0/1, on/off, true/false, yes/no)

psm3: Whether to use tagged messages for large size RMA or not (default: yes)

FI_PSM3_UUID: String

psm3: Unique Job ID required by the fabric

FI_PSM3_DELAY: Integer

psm3: Delay (seconds) before finalization (for debugging)

FI_PSM3_TIMEOUT: Integer

psm3: Timeout (seconds) for gracefully closing the PSM3 endpoint

FI_PSM3_CONN_TIMEOUT: Integer

psm3: Timeout (seconds) for establishing connection between two PSM3 endpoints

FI_PSM3_PROG_INTERVAL: Integer

psm3: Interval (microseconds) between progress calls made in the progress thread (default: 1 if affinity is set, 1000 if not)

FI_PSM3_PROG_AFFINITY: String

psm3: When set, specify the set of CPU cores to set the progress thread affinity to. The format is [:[:]][,[:[:]]]*, where each triplet :: defines a block of core_ids. Both and can be either the core_id (when >=0) or core_id - num_cores (when <0). (default: affinity not set)

FI_PSM3_INJECT_SIZE: Integer

psm3: Maximum message size for fi_inject and fi_tinject (default: 64).

FI_PSM3_LOCK_LEVEL: Integer

psm3: How internal locking is used. 0 means no locking. (default: 2).

FI_PSM3_LAZY_CONN: Boolean (0/1, on/off, true/false, yes/no)

psm3: Whether to force lazy connection mode. (default: no).

FI_PSM3_DISCONNECT: Boolean (0/1, on/off, true/false, yes/no)

psm3: Whether to issue disconnect request when process ends (default: no).

FI_PSM3_TAG_LAYOUT: String

psm3: How the 96 bit PSM3 tag is organized: tag60 means 32/4/60 for data/flags/tag;tag64 means 4/28/64 for flags/data/tag (default: tag60).

`

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

philip-davis commented 2 years ago

Yes, try the environment variables @eisen sent. If those don’t work, try setting FABRIiC_IFACE to mlx5_1

Sent from my mobile.

On Oct 8, 2021, at 5:29 PM, haiying xu @.***> wrote:  temp_rdm.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

halehawk commented 2 years ago

I tried, still same error "libfabric:236045:psm3:av:psmx3_epid_to_epaddr():231 psm2_ep_connect returned error Operation timed out, remote epid=61007d8903."

philip-davis commented 2 years ago

Could you try exporting FI_PSM2_DISCONNECT=1 after the writer is started but before the reader starts?

Sent from my mobile.

On Oct 8, 2021, at 6:13 PM, haiying xu @.***> wrote:

 temp_server.txt temp_client.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

halehawk commented 2 years ago

This setting has no impact.

halehawk commented 2 years ago

I checked the installation log of my libfabric, it looks like psm2 is not installed, but psm3 is installed. How can I disable psm3 as well, enforce to use verbs? I tried to disable psm3 while installing libfabric, but didn't work because psm3.h is available on the path.

philip-davis commented 2 years ago

Are you installing with Spack, or manually?

On Fri, Oct 8, 2021 at 9:06 PM haiying xu @.***> wrote:

I checked the installation log of my libfabric, it looks like psm2 is not installed, but psm3 is installed. How can I disable psm3 as well, enforce to use verbs? I tried to disable psm3 while installing libfabric, but didn't work because psm3.h is available on the path.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/2887#issuecomment-939196713, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRSFI4IKQCGRXW3UX37VGTUF6ISTANCNFSM5E56Z2BQ .

halehawk commented 2 years ago

I installed manually as this: ./configure --enable-cuda-dlopen --enable-gdrcopy-dlopen --enable-sockets=yes --enable-verbs=yes --enable-tcp=yes --enable-rxm=yes --enable-shm=yes --disable-psm --with-cuda=/glade/u/apps/dav/opt/cuda/11.0.3/

philip-davis commented 2 years ago

In 1.12.0, you can individually disable the psm providers with --disable-psm --disable-psm2 --disable-psm3

Alternatively, you can set the environment variable FI_PROVIDER to verbs to limit the native providers libfabric will interrogate at runtime.

On Fri, Oct 8, 2021 at 10:08 PM haiying xu @.***> wrote:

I installed manually as this: ./configure --enable-cuda-dlopen --enable-gdrcopy-dlopen --enable-sockets=yes --enable-verbs=yes --enable-tcp=yes --enable-rxm=yes --enable-shm=yes --disable-psm --with-cuda=/glade/u/apps/dav/opt/cuda/11.0.3/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/2887#issuecomment-939205522, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRSFI7RGHB3EBMC2LSBNP3UF6PZRANCNFSM5E56Z2BQ .

halehawk commented 2 years ago

After I did --disable-psm --disable-psm2 --disable-psm3, recompiled libfabric, then I used export FI_OFI_RXM_USE_SRX=1, export FI_PROVIDER=verbs, export FI_PSM2_DISCONNECT=1. My SST engine program works on two nodes. Great, thank you all very much!

halehawk commented 2 years ago

So I always have to know the server FABRIC_IFACE and the client FABRIC_IFACE before hand if they are changing on nodes, right?

philip-davis commented 2 years ago

Yes, in the absence of some value in FABRIC_IFACE, SST is going to pick the first fabric interface that meets its constraints; it doesn't have the ability to do a connectivity check on the client side.

On Mon, Oct 18, 2021 at 4:29 PM haiying xu @.***> wrote:

So I always have to know the server FABRIC_IFACE and the client FABRIC_IFACE before hand if they are changing on nodes, right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/2887#issuecomment-946140709, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRSFI2YNLDQPQ5WMJJBQZDUHR7UFANCNFSM5E56Z2BQ .

halehawk commented 2 years ago

Thanks. How can I run the server on multiple nodes? If I cannot, how can I get the best performance using SST and bpfile on one node?

philip-davis commented 2 years ago

You should be able to run the writer and/or reader on any number of available nodes. The only wrinkle might be if different ranks in the same MPI group need to choose different interface fabrics, in which case the correct value of FABRIC_IFACE should be set on a per-rank basis using setenv().

On Mon, Oct 18, 2021 at 5:16 PM haiying xu @.***> wrote:

Thanks. How can I run the server on multiple nodes? If I cannot, how can I get the best performance using SST and bpfile on one node?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/2887#issuecomment-946173231, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRSFIZLZ44MEP4TKMJCCYLUHSFELANCNFSM5E56Z2BQ .

halehawk commented 2 years ago

I run ibv_devinfo to get the fabric interface, is there any smart way to get this info and set to FABRIC_IFACE? Thanks!

philip-davis commented 2 years ago

Is this Casper? If so, the way I have found to differentiate the interfaces is to look at the link_layer field. For example, in the following, mlx5_0 has a link_type of 'Ethernet' and it won't work for RDMA access, while mlx5_1 has a link_type of 'Infiniband' and it will work for RDMA access.


hca_id: mlx5_0
    transport:          InfiniBand (0)
    fw_ver:             20.29.2002
    node_guid:          0c42:a103:006b:2a60
    sys_image_guid:         0c42:a103:006b:2a60
    vendor_id:          0x02c9
    vendor_part_id:         4123
    hw_ver:             0x0
    board_id:           MT_0000000222
    phys_port_cnt:          1
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         0
            port_lid:       0
            port_lmc:       0x00
            link_layer:     *Ethernet*

hca_id: mlx5_1
    transport:          InfiniBand (0)
    fw_ver:             20.29.2002
    node_guid:          0c42:a103:0098:3f14
    sys_image_guid:         0c42:a103:0098:3f14
    vendor_id:          0x02c9
    vendor_part_id:         4123
    hw_ver:             0x0
    board_id:           MT_0000000225
    phys_port_cnt:          1
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         19
            port_lid:       24
            port_lmc:       0x00
            link_layer:     *InfiniBand*

Unfortunately, the fabric selection done by SST is not fine-grained enough to be instructed to automatically choose a fabric based on this distinction. There are a couple different ways I can think of to deal with this programmatically. One is to do a large run of ibv_devinfo in order to assemble a mapping of nodes to interfaces on those nodes that have an Infiniband link layer and then read that mapping in at runtime to set the FABRIC_IFACE ahead of the ADIOS2 constructor.

A second way that is a little more elegant is to do what ibv_devinfo is doing to query the interfaces. Something like:


#include <stdio.h>
#include <stdlib.h>

#include <infiniband/verbs.h>

int main()
{
    struct ibv_device **dev_list, **orig_dev_list;
    struct ibv_context *ctx;
    struct ibv_device_attr_ex device_attr = {};
    struct ibv_port_attr port_attr;
    int port;
    const char *iface_name;
    int rc;

    dev_list = orig_dev_list = ibv_get_device_list(NULL);

    if (!dev_list) {
        perror("Failed to get IB devices list");
        return -1;
    }

    if (!*dev_list) {
        fprintf(stderr, "No IB devices found\n");
        goto out;
    }

    while (*dev_list) {
        ctx = ibv_open_device(*dev_list);
        if (!ctx) {
            fprintf(stderr, "Failed to open device\n");
            goto cleanup;
        }
        if (ibv_query_device_ex(ctx, NULL, &device_attr)) {
            fprintf(stderr, "Failed to query device props\n");
            goto cleanup;
        }
        for (port = 1; port <= device_attr.orig_attr.phys_port_cnt; ++port) {
            rc = ibv_query_port(ctx, port, &port_attr);
            if (rc) {
                fprintf(stderr, "Failed to query port %u props\n", port);
                goto cleanup;
            }
            if (port_attr.state == IBV_PORT_ACTIVE &&
                port_attr.link_layer == IBV_LINK_LAYER_INFINIBAND) {
                iface_name = ibv_get_device_name(*dev_list);
                fprintf(stderr, "Found potentially usable interface: %s\n",
                        iface_name);
                setenv("FABRIC_IFACE", iface_name, 0);
                break;
            }
        }
        ibv_close_device(ctx);
        ++dev_list;
    }

    ibv_free_device_list(orig_dev_list);

    return 0;

cleanup:
    if (ctx)
        ibv_close_device(ctx);
out:
    ibv_free_device_list(orig_dev_list);
    return 1;
}

This is more or less built out of snippets from the ibv_devinfo utility packaged with verbs: https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/examples/devinfo.c

I haven't tested it very extensively, but it basically just sets FABRIC_IFACE to the first interface with an Infiniband link type that is in an active state. It won't work for nodes that have more than one interface configured for infiniband operation, but it should work on Casper where one link is configured for Infiniband and the other for Ethernet.

On Tue, Oct 19, 2021 at 1:25 PM haiying xu @.***> wrote:

I run ibv_devinfo to get the fabric interface, is there any smart way to get this info and set to FABRIC_IFACE? Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/2887#issuecomment-946940713, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRSFI5EY5F7C54LBGCAGGDUHWSY7ANCNFSM5E56Z2BQ .

halehawk commented 2 years ago

Great, this code works like a charm on casper. Thanks!

halehawk commented 2 years ago

Do you know which ADIOS2 example can give the best write performance? I tried heattranfer example with sst/rdma writer and sst/rdma reader, and reader output bpfile. It shows the performance not as I expected.

pnorbert commented 2 years ago

We don't specifically have examples to assess SST performance. What are you trying to assess? Heat Transfer is iterating way too fast for its small data. It is great for tutorials but not for real world application performance.

Gray-Scott would be a better example since it is 3D and computes a bit more before output. It is still an example used in tutorials though. You may attempt to scale the data size up but that increases computation too so the ratio is still quite low.

Publicly available application is LAMMPS that can be built with the ADIOS2 user module and configured for a billion atom simulation to move a lot of data. One could use adios2_reorganize to write to disk asynchronously. We also have a test set up for using the LAMMPS executable built with ADIOS2 (lmp), using RDF calculation on the fly. But I don't remember if we have results somewhere about this. It is using Savanna/Cheetah to test it but the config files are there for LAMMPS to run.

Another option is to fabricate your own adios2_iotest to imitate the data size and frequency of output with a simple code (see $ADIOS2_DIR/share/iotest-config examples).

halehawk commented 2 years ago

@pnorbert, I checked all the examples that you mentioned, it looks like they all use bpfile/bp4 as writer engine and reader engine. Does it mean bp4 writer engine can give you the best write performance over all others such as sst/rmda, sst/wan, hdf5? I am interested to know if sst/rmda writer engine have asynchronous mode. I tried Put/Get deferred mode, but I didn't get much improvement. Thanks!

ax3l commented 2 years ago

Hi there,

ECP WarpX maintainer here. I am trying to document a work-around for all our users (same for @franzpoeschel for PIConGPU) by applying your current work-around above https://github.com/ornladios/ADIOS2/issues/2887#issuecomment-939098556 to all our user-facing job scripts.

I want to avoid negative interference with general MPI comms and other aspects, thus can we clarify these options please before I place them in all our Summit/Cori/Perlmutter/Spock job scripts?

export FABRIC_IFACE=mlx5_0

This limits the visible devices from a huge number of devices visible in libfabric 1.6+ to a smaller/reasonable number of mellanox cards per node?

Summit has a single NIC but with a dual EDR ports. Will both still be used after this?

FABRIC_IFACE seems to be an undocumented SST option: https://github.com/ornladios/ADIOS2/search?q=FABRIC_IFACE Can we please add documentation for it in the ADIOS manual? :-)

Am I right in the assumption that this is an ADIOS SST only option and has no influence on any other aspects of libfabric, especially, it does not influence anything in our MPI comms for our app? I quickly checked in libfabric, but just want to double-check.

export FI_OFI_RXM_USE_SRX=1

This is a libfabric option, documented as:

FI_OFI_RXM_USE_SRX Set this to 1 to use shared receive context from MSG provider. This reduces overall memory usage but there may be a slight increase in latency (default: 0).

Can you comment why this is needed? This will probably influence the performance of our application-side MPI communications as well?

Thanks for your help and debugging these nifty changes in libfabric :)

cc @eisenhauer @philip-davis

franzpoeschel commented 2 years ago

Additionally to export FABRIC_IFACE=mlx5_0 and export FI_OFI_RXM_USE_SRX=1, I see that the following two variables have been suggested, and I'm not sure if we should include them into our documentation as well:

eisenhauer commented 2 years ago

We use different providers on different hosts, so specifying FI_PROVIDER=verbs everywhere (or not checking for gni and psm2 would be restrictive).

WRT FI_PSM2_DISCONNECT, I think it's a good recommendation. From the docs: "FI_PSM2_DISCONNECT : The provider has a mechanism to automatically send disconnection notifications to all connected peers before the local endpoint is closed. As the response, the peers call psm2_ep_disconnect to clean up the connection state at their side. This allows the same PSM2 epid be used by different dynamically started processes (clients) to communicate with the same peer (server). This mechanism, however, introduce extra overhead to the finalization phase. For applications that never reuse epids within the same session such overhead is unnecessary."

franzpoeschel commented 2 years ago

Thank you for the clarification. Then, we'll leave FI_PROVIDER out. Are we safe with just specifying FI_PSM2_DISCONNECT in the beginning of a job script? Philip's earlier comment on that:

Could you try exporting FI_PSM2_DISCONNECT=1 after the writer is started but before the reader starts?

This reads like it should be set at the right time, but maybe I'm just understanding things wrong

suchyta1 commented 2 years ago

Has anyone tested what settings should be used on Slingshot / Spock? I have a WDMApp run on Spock, which works if I set SST to use WAN, but which fails if I set it to use RDMA instead. I'm using the single libfabric module installed on Spock, which is libfabric/1.11.0.4.75.

I've attached the output from setting SstVerbose=1 and FI_LOG_LEVEL=Debug (in the stderr files). XGC is reading data from GENE, and gets a connection refusal. I realize these two files might not be all that you need to understand what's going on, so let me know what you'd like me to share to help debug.

codar.workflow.stderr.xgc.log codar.workflow.stdout.xgc.log codar.workflow.stderr.gene.log codar.workflow.stdout.gene.log

eisenhauer commented 2 years ago

Hi Eric. Can you do SstVerbose=5 instead? (Verbose is a range, 0-5, so 5 will give us a lot more info.) FYI, the "unexpected connection close event" output generally just represents the fact that the other party dropped an existing link without going through any shutdown protocol. Often, that means that the opposite party died, for whatever reason. Here it looks like XGC got an unexpected connection close, so probably the it was gene that died. Looking at the stderr.gene.log file, it looks like that on at least some ranks, gene died from a floating point exception at src/gene.F90:283. Now, you said it ran fine with WAN, but not RDMA, so it may be that RDMA isn't delivering the right data and that's why it dies, so it's not necessarily the case that this is a problem with gene, but looking at that line in gene.F90 might give us some clues (as well as the output with SstVerbose=5).

suchyta1 commented 2 years ago

Here are the logs with SstVerbose=5.

codar.workflow.stderr.gene.log codar.workflow.stdout.gene.log codar.workflow.stderr.xgc.log codar.workflow.stdout.xgc.log

eisenhauer commented 2 years ago

Hey Eric. Just wanted to let you know that I wasn't ignoring this, but instead am having trouble sorting out the next steps. I think that the problem might be caused by xgc and gene seeing different values of the sar_limit in libfabric (or more specifically, gene seems to see it not set, and xgc sees it set to zero, an unacceptably low value). libfabric seems to be using fi_param_get() to query this, and parameters are supposed to be defined by fi_param_define(). But this is a relatively new call in libfabric, and the ADIOS RDMA doesn't use it at all. So I would think that they should both see it as undefined, but somehow they are not. So I need to sort out how this might be happening and where to go from there. I'll let you know.

franzpoeschel commented 2 years ago

I am currently experiencing trouble with libfabric on Summit. I tested this same streaming workflow after the recent system upgrade and things were still working, but apparently no longer. Does anyone have any hints for setting up a streaming workflow on Summit? Has anyone successfully streamed on Summit since New Year?

I've used the following test matrix:

Versions tested: (1) ADIOS2 2.7.1 release, libfabric 1.6.2 (2) ADIOS2 master (tag 53c551), libfabric/1.13.1-sysrdma module (3) ADIOS2 master (tag 53c551), libfabric 1.12.2, compiled with --disable-psm --disable-psm2 --disable-psm3

Setup: (a) One node, 6 GPUs streaming to one process (b) Strongly-scaled up to 10 nodes (same data size) (c) Weakly-scaled up to 10 nodes (10 times the original data size)

Environment variables: (i) FI_PROVIDER=verbs, FABRIC_IFACE=mlx5_0, FI_OFI_RXM_USE_SRX=1 (iii) FI_PROVIDER=verbs, FABRIC_IFACE=mlx5_0, FI_OFI_RXM_USE_SRX=1, FI_OFI_RXM_USE_SRX=1, FI_PSM2_DISCONNECT=1

export OMPI_MCA_coll_ibm_skip_barrier=true is active in both variants.

Results: (1,a,i) Setup works, 1.77 TB written (1,a,iii) Setup works, 1.77 TB written (1,b,i) Writer finishes computation, reader hangs on reading first step data (step opened successfully) [1] (1,b,iii) No difference to (1,b,i) (1,c,i) No difference to (1,b,i) (1,c,iii) No difference to (1,b,i) (2,a,i) Writer hangs as soon as the SST queue is full, reader hangs on reading first step data (step opened successfully) [2] (2,a,iii) First step is streamed successfully, writer finishes successfully, [3] (2,b,i) Similar to (2,a,i), but different stderr: [4] (2,b,iii) Similar to (2,a,i), but different stderr: [5] (2,c,i) Behavior like (2,a,iii), reader stderr like [1] (2,c,iii) Like (2,c,i) (3,a,iii) First step is streamed successfully, writer continues with large time penalty, stderr like [3] (3,b,iii) Like (1,b,i) (3,c,iii) Like (1,b,i), but writer continues with large time penalty

Excerpts from stderr logs:

[1] End of reader's stderr with SST_VERBOSE:

Received contact info for WS_stream 0x347f2c20, WSR Rank 56
Received contact info for WS_stream 0x4dd2db10, WSR Rank 57
Received contact info for WS_stream 0x3a5bd430, WSR Rank 58
Received contact info for WS_stream 0x53baaa80, WSR Rank 59
Sending Reader Activate messages to writer
Finish opening Stream "openPMD/simData", starting with Step number 0
Wait for next metadata after last timestep -1
Waiting for metadata for a Timestep later than TS -1
(PID 384125, TID 20000004add0) Stream status is Established
Received a Timestep metadata message for timestep 0, signaling condition
Examining metadata for Timestep 0
Returning metadata for Timestep 0
Setting TSmsg to Rootentry value
RdmaTimestepArrived with Timestep = 0, PreloadMode = 0
SstAdvanceStep returning Success on timestep 0
ERROR:  One or more process (first noticed rank 8) terminated with signal 9 # job time limit

[2] End of reader's stderr with SST_VERBOSE:

        fi_link_attr:
            address: (null)
            mtu: 4096
            speed: 100000000000
            state: FI_LINK_UP
            network_type: InfiniBand

Waiting for writer response message in SstReadOpen("openPMD/simData")
finished wait writer response message in read_open
Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=1
Param -   QueueFullPolicy=Block
Param -   DataTransport=rdma
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable)
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=rdma
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=6000 (seconds)
Param -   SpeculativePreloadMode=Off
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer is doing BP-based marshalling
Writer is using Minimum Connection Communication pattern (min)
Received contact info for WS_stream 0x36795700, WSR Rank 0
Received contact info for WS_stream 0x42f8fea0, WSR Rank 1
Received contact info for WS_stream 0x297fdc90, WSR Rank 2
Received contact info for WS_stream 0x3f854be0, WSR Rank 3
Received contact info for WS_stream 0x51a29790, WSR Rank 4
Received contact info for WS_stream 0x3c4c6c80, WSR Rank 5
Sending Reader Activate messages to writer
Finish opening Stream "openPMD/simData", starting with Step number 0
Wait for next metadata after last timestep -1
Waiting for metadata for a Timestep later than TS -1
(PID b93cf, TID 20000004af40) Stream status is Established
Received a Timestep metadata message for timestep 0, signaling condition
Examining metadata for Timestep 0
Returning metadata for Timestep 0
Setting TSmsg to Rootentry value
RdmaTimestepArrived with Timestep = 0, PreloadMode = 0
SstAdvanceStep returning Success on timestep 0
Performing remote read of Writer Rank 0 at step 0
Block address is 0x2007e88f0010, with a key of 25410
Remote read target is Rank 0 (Offset = 57508685, Length = 19169280)
ERROR:  One or more process (first noticed rank 0) terminated with signal 12 # What is the meaning of this signal on Summit?

[3] End of reader's stderr with SST_VERBOSE:

Rank 0, RdmaWaitForCompletion
Rank 0, RdmaWaitForCompletion
Rank 0, RdmaWaitForCompletion
Rank 0, RdmaWaitForCompletion
Rank 0, RdmaWaitForCompletion
Rank 0, RdmaWaitForCompletion
got completion for request with handle 0x79e8d890 (flags 260).
Rank 0, RdmaWaitForCompletion
got completion for request with handle 0x79e8d9d0 (flags 260).
Rank 0, RdmaWaitForCompletion
got completion for request with handle 0x79e8dee0 (flags 260).
Rank 0, RdmaWaitForCompletion
got completion for request with handle 0x79ea8270 (flags 260).
Rank 0, RdmaWaitForCompletion
got completion for request with handle 0x79ea8430 (flags 260).
Rank 0, RdmaWaitForCompletion
[…]
got completion for request with handle 0x79eaf5b0 (flags 260).
Rank 0, RdmaWaitForCompletion
got completion for request with handle 0x79eaf770 (flags 260).
Sending ReleaseTimestep message for timestep 1, one to each writer
ERROR:  One or more process (first noticed rank 0) terminated with signal 9

[4] End of reader's stderr with SST_VERBOSE:

Received contact info for WS_stream 0x4fd61d60, WSR Rank 54
Received contact info for WS_stream 0x2c947760, WSR Rank 55
Received contact info for WS_stream 0x491b1a90, WSR Rank 56
Received contact info for WS_stream 0x3f0b6bd0, WSR Rank 57
Received contact info for WS_stream 0x59f76b30, WSR Rank 58
Received contact info for WS_stream 0x49b57750, WSR Rank 59
Sending Reader Activate messages to writer
Finish opening Stream "openPMD/simData", starting with Step number 0
Wait for next metadata after last timestep -1
Waiting for metadata for a Timestep later than TS -1
(PID 1b0040, TID 20000004af40) Stream status is Established
Received a Timestep metadata message for timestep 0, signaling condition
Examining metadata for Timestep 0
Returning metadata for Timestep 0
Setting TSmsg to Rootentry value
RdmaTimestepArrived with Timestep = 0, PreloadMode = 0
SstAdvanceStep returning Success on timestep 0
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
Performing remote read of Writer Rank 0 at step 0
Block address is 0x20080c000010, with a key of 32096
Remote read target is Rank 0 (Offset = 7078733, Length = 2359296)
ERROR:  One or more process (first noticed rank 0) terminated with signal 12

[5] End of reader's stderr with SST_VERBOSE:

Posted RDMA get for Writer Rank 1 for handle 0x1ad2c970
Performing remote read of Writer Rank 2 at step 0
Block address is 0x20080c000010, with a key of 30800
Remote read target is Rank 2 (Offset = 913052996, Length = 58982400)
Posted RDMA get for Writer Rank 2 for handle 0x1ad2cb30
Performing remote read of Writer Rank 3 at step 0
Block address is 0x20080c000010, with a key of 62421
Remote read target is Rank 3 (Offset = 913052996, Length = 58982400)
Posted RDMA get for Writer Rank 3 for handle 0x1ad2cee0
Performing remote read of Writer Rank 4 at step 0
Block address is 0x20080c000010, with a key of 127183
Remote read target is Rank 4 (Offset = 913052996, Length = 58982400)
Posted RDMA get for Writer Rank 4 for handle 0x1ad32ba0
Performing remote read of Writer Rank 5 at step 0
Block address is 0x20080c000010, with a key of 43658
Remote read target is Rank 5 (Offset = 913052996, Length = 58982400)
Posted RDMA get for Writer Rank 5 for handle 0x1ad32d60
Performing remote read of Writer Rank 6 at step 0
Block address is 0x20080c000010, with a key of 89400
Remote read target is Rank 6 (Offset = 913052996, Length = 58982400)
Posted RDMA get for Writer Rank 6 for handle 0x1ad32f20
Performing remote read of Writer Rank 7 at step 0
Block address is 0x20080c000010, with a key of 41344
Remote read target is Rank 7 (Offset = 913052996, Length = 58982400)
Posted RDMA get for Writer Rank 7 for handle 0x1ad330e0
Performing remote read of Writer Rank 8 at step 0
Block address is 0x20080c000010, with a key of 12816
Remote read target is Rank 8 (Offset = 913052996, Length = 58982400)
ERROR:  One or more process (first noticed rank 9) terminated with signal 12
suchyta1 commented 2 years ago

@franzpoeschel I don't know if I have anything to add, but @pnorbert asked to me to recheck, and I've been having issues as well.

franzpoeschel commented 2 years ago

Thank you for checking @suchyta1! I guess that this suggests a systemic issue?

eisenhauer commented 2 years ago

Hi All. Following up on this is on my to-do list, but unfortunately I've been diverted to other tasks this month and haven't done much. Thank you, @franzpoeschel, for the extensive testing. Offhand, it doesn't look like the spock and summit issues are related. However, that said, I don't have a fix for either one yet. Have there been significant summit updates more recent than the OS change last year?

eisenhauer commented 2 years ago

So, I just did a small scale test (2 writers, 2 readers) on summit using libfabric 1.13.2 and everything ran to completion without apparent issues. This was with: export FABRIC_IFACE=mlx5_0 export OMPI_MCA_coll_ibm_skip_barrier=true export FI_MR_CACHE_MAX_COUNT=0 export FI_OFI_RXM_USE_SRX=1

I'll be repeating this with more readers/writers to see if where I might encounter issues.

suchyta1 commented 2 years ago

I may have forgotten an environment variable, so I'll need to recheck when I get a chance.

eisenhauer commented 2 years ago

A 10x10 test ran to completion on summit too...

franzpoeschel commented 2 years ago

Thank you for having a look @eisenhauer Was this 10x10 test on a single node or across multiple nodes (so 10x10 ranks or 10x10 nodes)? The two successful runs that I did have were both single node. I see that you also specify FI_MR_CACHE_MAX_COUNT=0 which I did not, I will try that and see if it changes anything.

suchyta1 commented 2 years ago

When I added the environment variable I had neglected, my Summit job indeed worked. I did a WDMApp job, which has multiple processes on each node, XGC and GENE both need multiple nodes, but XGC and GENE are never co-located on the same node. XGC and GENE both read and write. The codes weren't rebuilt since the new year, and were using libfabric/1.13.1-sysrdma.

franzpoeschel commented 2 years ago

Adding FI_MR_CACHE_MAX_COUNT=0 did not change the behavior for me unfortunately.

I forgot addressing your question "Have there been significant summit updates more recent than the OS change last year?", but I'm not aware of any major changes either.

Did either of you run your tests with SstVerbose=5 enabled and still have the log available @suchyta1 @eisenhauer? I'd like to see if there is potentially a configuration error on my end. I'm looking for this part in particular:

Fabric parameters to use at fabric initialization: fi_info:
    caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM, FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV ]
    mode: [ FI_LOCAL_MR ]
    addr_format: FI_SOCKADDR_IN
    src_addrlen: 16
    dest_addrlen: 0
    src_addr: fi_sockaddr_in://10.41.3.242:0
    dest_addr: (null)
    handle: (null)
    fi_tx_attr:
        caps: [ FI_SOURCE, FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE ]
        mode: [ FI_LOCAL_MR ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_STRICT ]
        comp_order: [  ]
        inject_size: 16320
        size: 1024
        iov_limit: 4
        rma_iov_limit: 1
    fi_rx_attr:
        caps: [ FI_SOURCE, FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV ]
        mode: [ FI_LOCAL_MR ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_STRICT ]
        comp_order: [  ]
        total_buffered_recv: 0
        size: 1024
        iov_limit: 4
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_RXM
        protocol_version: 1
        max_msg_size: 1073741824
        msg_prefix_size: 0
        max_order_raw_size: 1073741824
        max_order_war_size: 0
        max_order_waw_size: 1073741824
        mem_tag_format: 0xaaaaaaaaaaaaaaaa
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        auth_key_size: 0
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_RXM
        protocol_version: 1
        max_msg_size: 1073741824
        msg_prefix_size: 0
        max_order_raw_size: 1073741824
        max_order_war_size: 0
        max_order_waw_size: 1073741824
        mem_tag_format: 0xaaaaaaaaaaaaaaaa
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        auth_key_size: 0
    fi_domain_attr:
        domain: 0x0
        name: mlx5_0
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_AUTO
        data_progress: FI_PROGRESS_AUTO
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_BASIC ]
        mr_key_size: 4
        cq_data_size: 4
        cq_cnt: 65536
        ep_cnt: 32768
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        max_ep_tx_ctx: 1
        max_ep_rx_ctx: 1
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 0
        mr_iov_limit: 1
    caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
    mode: [  ]
        auth_key_size: 0
        max_err_data: 0
        mr_cnt: 0
    fi_fabric_attr:
        name: IB-0x18338657682652659712
        prov_name: verbs;ofi_rxm
        prov_version: 1.0
        api_version: 1.5
suchyta1 commented 2 years ago

I've submitted a job to the queue with SstVerbose=5 and I'll post the logs.

suchyta1 commented 2 years ago

wdmapp-logs.tar.gz

franzpoeschel commented 2 years ago

I think I have located the issue: My reading application somehow did not find the right MPI installation and instead started independent processes on each node. Each of these processes then requested to read all the data from all nodes which was too much. So, no ADIOS2 issue. I'll run a test and see if this fixed it.

eisenhauer commented 2 years ago

Interesting. Hope that solves it.

franzpoeschel commented 2 years ago

It solves it (partially). The issue with the reading application's crash after a few iterations seems to persist, but that might be on me. I'll have to look further and will report back.

halehawk commented 1 year ago

Hi,

I am working on slingshot 11 to check if I can get rdma setup for adios2 sst engine. There is no ib, I only have cxi (cray provider) or hsn (tcp provider). Do you know how I can setup the following environment variables? Or is there anything else I need to setup? export FI_PROVIDER=gni export FABRIC_IFACE=cxi0 export FI_OFI_RXM_USE_SRX=1

Here are what I got from SstVerbose=5 Writer 0 (0x6eff300): Sst set to use sockets as a Control Transport Writer 0 (0x7360120): Sst set to use sockets as a Control Transport DP Writer 0 (0x6eff300): RDMA Dataplane could not find any viable fabrics. DP Writer 0 (0x6eff300): RDMA Dataplane could not find an RDMA-compatible fabric. DP Writer 0 (0x6eff300): RDMA Dataplane evaluating viability, returning priority -1 DP Writer 0 (0x6eff300): Prefered dataplane name is "evpath" DP Writer 0 (0x6eff300): Considering DataPlane "evpath" for possible use, priority is 1 DP Writer 0 (0x6eff300): Selecting DataPlane "evpath" (preferred) for use DP Writer 0 (0x6eff300): RDMA Dataplane unloading DP Writer 1 (0x7360120): RDMA Dataplane could not find any viable fabrics. DP Writer 1 (0x7360120): RDMA Dataplane could not find an RDMA-compatible fabric. DP Writer 1 (0x7360120): RDMA Dataplane evaluating viability, returning priority -1 DP Writer 1 (0x7360120): RDMA Dataplane unloading Writer 1 (0x7360120): Stream "result_prim" waiting for 1 readers Writer 0 (0x6eff300): Opening Stream "result_prim" Writer 0 (0x6eff300): Writer stream params are: Param - RegistrationMethod=File Param - RendezvousReaderCount=1 Param - QueueLimit=0 (unlimited) Param - QueueFullPolicy=Block Param - DataTransport=evpath Param - ControlTransport=sockets Param - NetworkInterface=(default) Param - ControlInterface=(default to NetworkInterface if applicable) Param - DataInterface=(default to NetworkInterface if applicable) Param - CompressionMethod=None Param - CPCommPattern=Min Param - MarshalMethod=BP Param - FirstTimestepPrecious=False Param - IsRowMajor=1 (not user settable) Param - OpenTimeoutSecs=60 (seconds) Param - SpeculativePreloadMode=Auto Param - SpecAutoNodeThreshold=1 Param - ControlModule=select

eisenhauer commented 1 year ago

Hi @halehawk The short answer is that the SST RDMA layer can not yet use the CXI provider. We hope to fix this at some point, but in the meantime SST will only use the evpath data plane (which is TCP-based).