mercury-hpc / mercury

Mercury is a C library for implementing RPC, optimized for HPC.
http://www.mcs.anl.gov/projects/mercury/
BSD 3-Clause "New" or "Revised" License
169 stars 62 forks source link

NA OFI: support vector rma operations #346

Open carns opened 4 years ago

carns commented 4 years ago

Is your feature request related to a problem? Please describe.

The na_ofi.c already uses libfabric API calls that support vectors for local and remote buffers (fi_writemsg() and fi_readmsg()), but it hard codes the vector lengths to 1 (https://github.com/mercury-hpc/mercury/blob/06bf87e4403dd1e487b53204569e05296e6a5061/src/na/na_ofi.c#L4171).

Describe the solution you'd like

It would be better if it translated vectors from the bulk handle into native vectors for libfabric, as is done in the na_sm.c code here: https://github.com/mercury-hpc/mercury/blob/06bf87e4403dd1e487b53204569e05296e6a5061/src/na/na_sm.c#L3733

We need to confirm if the relevant libfabric providers actually support this capability; if not we will need to retain a fallback path for providers that don't (and add that to the flags checked in the NA_OFI_PROV_TYPES so we can quickly toggle if offending providers gain support later).

Additional context

From a quick scan of libfabric it looks like gni and psm2 likely support this. Verbs may not but we need to confirm. See numerous FI_E2BIG return codes in the verbs provider, but I'm not sure what ocde paths are important here when using rxm.

carns commented 4 years ago

For some more context, here is code in MPICH that is issuing vector libfabric operations when needed:

https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/netmod/ofi/ofi_rma.h#L514

I don't see guards around the provider type there, though it's possible that I'm missing something.

soumagne commented 4 years ago

Yes I had looked at that in the past but it was not supported. See https://github.com/ofiwg/libfabric/issues/3062 We should have a look at it again and maybe have a query call to fall back for providers that do not support it.

carns commented 4 years ago

I didn't think to check the memory registration path; I just looked at what some of the codes are doing on the actual read/write RMA path. I guess we'll be testing both with this change :)

carns commented 4 years ago

MPICH appears to be setting msg.desc to NULL:

https://github.com/pmodels/mpich/blob/master/src/mpid/ch4/netmod/ofi/ofi_rma.h#L480

I don't know the implications, just pointing that out.

liuxuezhao commented 4 years ago

I ever considered it long time ago. The vector rma should be supported by OFI, but OFI providers with limit setting for the local iov_count and remote rma_iov_count (fi_info::tx_attr::iov_limit/::rma_iov_limit), and different provider can with different limit number can be queried by fi_getinfo(). Just FYI.

soumagne commented 4 years ago

@carns All providers set mr_iov_limit and rma_iov_limit to 1, I am still doing some more testing and we'll support it internally now but I would keep hopes very low at this point :\

soumagne commented 4 years ago

@carns I was looking in more details at the mpich code that you had sent. I think it's still limited though since they also always set iov_count and rma_iov_count to 1.

carns commented 4 years ago

Ah, Ok. Thanks for the updates.

soumagne commented 4 years ago

Supporting code was added as part of https://github.com/mercury-hpc/mercury/commit/a6bbce8514011e21e5ef1b37283a17606a0d39cb but there are issues remaining with libfabric that will need to be further investigated.

roblatham00 commented 3 years ago

What are the "issues remaning with libfabric" ? Phil pointed me to this post https://lists.openfabrics.org/pipermail/libfabric-users/2021-June/000861.html documenting the FI_VERBS_TX_IOV_LIMIT and FI_VERBS_RX_IOV_LIMIT environment variables, and the ibv_devinfo -v utility. on summit, it looks like we should set those environment variables to 30

But no impact on the vectored transfer benchmark: Before:

<op>    <warmup_seconds>        <concurrency>   <threads>       <xfer_size>     <vector_count>  <total_bytes>   <seconds>       <MiB/s> <xfer_memory>   <align_buffer>
PULL    1       1       0       524288  1       209230757888    20.000072       9976.864320     0
PULL    1       1       0       524288  2       209364451328    20.000095       9983.227515     0
PULL    1       1       0       524288  4       210315509760    20.000066       10028.592124    0
PULL    1       1       0       524288  8       210331238400    20.000078       10029.335904    0
PULL    1       1       0       524288  16      200076689408    20.000072       9540.365654     0
PULL    1       1       0       524288  32      192866156544    20.000097       9196.530270     0
PULL    1       1       0       524288  64      107745378304    20.000236       5137.639306     0
PULL    1       1       0       524288  128     53734277120     20.000313       2562.209957     0
PULL    1       1       0       524288  256     27163885568     20.000662       1295.232153     0

After

PULL    1       1       0       524288  2       203238670336    20.000096       9691.128674     0
PULL    1       1       0       524288  4       202780442624    20.000068       9669.291918     0
PULL    1       1       0       524288  8       202621583360    20.000085       9661.708882     0
PULL    1       1       0       524288  16      193659404288    20.000131       9234.339565     0
PULL    1       1       0       524288  32      185765724160    20.000126       8857.944351     0
PULL    1       1       0       524288  64      112774348800    20.000202       5377.445768     0
PULL    1       1       0       524288  128     56699125760     20.000352       2703.577398     0
PULL    1       1       0       524288  256     28120711168     20.000367       1340.875384     0
soumagne commented 3 years ago

I would need to look at it again but I had added support for it, you can turn it on by uncommenting that line: https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L196 I think if I recall correctly there were some issues with OFI not transferring things correctly. I don't think I have tried those environment variables though.

roblatham00 commented 3 years ago

No change uncommenting that define:

PULL    1       1       0       524288  1       188446408704    20.000071       8985.793186     0
PULL    1       1       0       524288  2       189848879104    20.000109       9052.650574     0
PULL    1       1       0       524288  4       192025198592    20.000094       9156.432103     0
PULL    1       1       0       524288  8       192173572096    20.000107       9163.500843     0
PULL    1       1       0       524288  16      183590453248    20.000067       8754.245780     0
PULL    1       1       0       524288  32      178335514624    20.000155       8503.634210     0
PULL    1       1       0       524288  64      108015386624    20.000248       5150.511207     0
PULL    1       1       0       524288  128     54678519808     20.000225       2607.245722     0
PULL    1       1       0       524288  256     26795311104     20.000391       1277.675006     0
roblatham00 commented 3 years ago

Here's a graph of the three cases, just to make it more clear:

margo-summit-vector-bw

roblatham00 commented 3 years ago

a-ha! I stuck some debugging into hg_bulk_create: it tells me
==== bulk: count: 64 max_segments: 1

roblatham00 commented 3 years ago

fi_info -v gives me this:

        domain: 0x0
        name: mlx5_1
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_AUTO
        data_progress: FI_PROGRESS_AUTO
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_LOCAL, FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY ]
        mr_key_size: 4
        cq_data_size: 0
        cq_cnt: 16777216
        ep_cnt: 262144
        tx_ctx_cnt: 1024
        rx_ctx_cnt: 1024
        max_ep_tx_ctx: 1024
        max_ep_rx_ctx: 1024
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 8388608
        cntr_cnt: 0
        mr_iov_limit: 1

that mr_iov_limit is what i need to adjust: mercury consults that field :

priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;

setting the env changes the iov_limit but the mr_iov_limit is hard coded to 1 (see fi_verbs.h #define VERBS_MR_IOV_LIMIT)

Dang, I really thought I had something by changing that define in libfabric, recompiling, and trying again, however mercury still tells me max_segments is 1. fi_info tells me mr_iov_limit is now 30... so why is mercury insisting it is 1?

soumagne commented 3 years ago

Ah ok from what I see it looks like it's because of the rxm provider. When you're running fi_info, are you specifying verbs;ofi_rxm ? or only verbs?

roblatham00 commented 3 years ago

I'm not calling the fi_info() routine. I'm calling the fi_info utility, which I think by default dumps all the providers it can find?

The rxm angle is interesting, but I don't see "1" in there. I do see

prov/rxm/src/rxm.h:#define RXM_IOV_LIMIT 4

so I'd expect a "sticky" 4, but I don't know where the 1 comes from.

roblatham00 commented 3 years ago

I figured "what the heck" and hard-coded 30 in mercury's na_ofi_initialize:

diff --git a/src/na/na_ofi.c b/src/na/na_ofi.c
index f3e6e47d..1152f26b 100644
--- a/src/na/na_ofi.c
+++ b/src/na/na_ofi.c
@@ -3928,7 +3928,9 @@ na_ofi_initialize(na_class_t *na_class, const struct na_info *na_info,
 #endif

     /* Cache IOV max */
-    priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;
+    //XXX: giant hack.  set to 30 and see what happens
+    //priv->iov_max = priv->domain->fi_prov->domain_attr->mr_iov_limit;
+    priv->iov_max = 30;

     /* Create endpoint */
     ret = na_ofi_endpoint_open(priv->domain, node_ptr, src_addr, src_addrlen,

Now vector lengths of 1 are ok, and 32 and higer are ok, but vector lengths of 2-16 give me HG_INVALID_ARG

soumagne commented 3 years ago

@roblatham00 I don't think I was implying that you were calling the fi_info() routine :) there are options that you can pass to the fi_info utility and one of them is -p which lets you specify which provider you're querying. I was suggesting that you pass verbs\;ofi_rxm since otherwise yes it would just dump everything. You're also confusing the MR IOV limit and the regular IOV limit I think. The MR IOV limit is the maximum number of buffers that we can pass to memory registration calls at once, i.e. the maximum number of buffers that may be associated with a single memory region. This is basically what happens when you make a call to HG_Bulk_create() with N buffers, we register these N buffers if possible so that they appear as a single virtual region. The regular IOV limit on the other hand defines the maximum number of buffers at once that are passed to fi_writev() or fi_readv() operations. We currently do not have a mode in mercury that allows for passing multiple buffers at once when doing a bulk transfer if they are not part of the same virtual region. I believe I would need to add another query call then / possibly additional NA callbacks (current NA put/get take only a single MR region) and pass that array of buffers and an array of memory regions. Let me look into this.

roblatham00 commented 3 years ago

Thanks for the libfabric information. You are 100% correct that I am confused by libfabric domains, providers, etc.

Let me be more concrete:

I'm using Phil's benchmark to measure this: https://github.com/mochi-hpc-experiments/mochi-tests/blob/main/perf-regression/margo-p2p-vector.c#L243

margo_bulk_create immediately calls HG_Bulk_create(): This is a single virtual region, right? We are just pointing to different parts of it?

carns commented 3 years ago

Right, that code is creating a single registration to represent a vector of memory regions. That entire registration (as a whole, including all vectors) is then transmitted with a bulk transfer operation.

Going by @soumagne ;s explanation, that means that this benchmark is reliant on the mr_iov_limit rather than the iov_limit (thanks for the explanation on that- I was wondering what the exact difference was).

Mercury doesn't have an API to pick out pieces with a contiguous memory registration, or for that matter to construct a vector from different registrations. That could be a neat capability, though. MPICH for example uses this to stitch pre-registered header structs onto data payloads. You could also imagine it being used for 2-phase style aggregation, where one party doesn't know anything except the aggregate size, while the other parties fill in specific portions using vector operations. Right now we couldn't use vectors for that.

soumagne commented 3 years ago

I think at the very least, the benchmark could still be relevant if the virtual selection made internally results in several segments being transferred at once (internally registered as separate regions). I think the first step is to enable that and add support for it within the NA layer and then we can see about adding new vector types of APIs to HG Bulk itself.