ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
532 stars 371 forks source link

How does shm handle inject pool exhaustion? #8976

Open shijin-aws opened 1 year ago

shijin-aws commented 1 year ago

I was debugging a segfault when running LAMMPS on 48 aws hpc6a instances: 96 ranks per node, 4608 ranks totally, with libfabric main branch. The backtrace shows the error happens when shm tries to copy data from user buffer to bounce buffer (tx_buf->data, allocated in inject buffer pool) in peer smr.

(gdb) bt
#0  0x00007f324119c053 in __memcpy_ssse3 () from /lib64/libc.so.6
#1  0x00007f32342394c4 in ofi_memcpy (device=0,
    dest=0xff0b59ebcfd27fd1, src=0x3778440, size=840)
    at ./include/ofi_hmem.h:233
#2  0x00007f323423966f in ofi_copy_from_hmem (iface=FI_HMEM_SYSTEM,
    device=0, dest=0xff0b59ebcfd27fd1, src=0x3778440, size=840)
    at ./include/ofi_hmem.h:358
#3  0x00007f3234239cee in ofi_copy_mr_iov (mr=0x7ffefb0677d0,
    iov=0x7ffefb067800, iov_count=1, offset=0,
    buf=0xff0b59ebcfd27fd1, size=3256, dir=0) at src/hmem.c:315
#4  0x00007f3234239d8b in ofi_copy_from_mr_iov (
    dest=0xff0b59ebcfd27fd1, size=4096, mr=0x7ffefb0677d0,
    iov=0x7ffefb067800, iov_count=1, iov_offset=0) at src/hmem.c:330
#5  0x00007f323434ee0b in smr_format_inject (cmd=0x7f31da70a060,
    mr=0x7ffefb0677d0, iov=0x7ffefb067800, count=1,
    smr=0x7f31da6fc000, tx_buf=0xff0b59ebcfd27fd1)
    at prov/shm/src/smr_ep.c:355
#6  0x00007f323434fd78 in smr_do_inject (ep=0x27bf090,
    peer_smr=0x7f31da6fc000, id=77, peer_id=13, op=1,
    tag=4294967279, data=3661, op_flags=16908288,
    desc=0x7ffefb0677d0, iov=0x7ffefb067800, iov_count=1,
    total_len=840, context=0x3eade68, cmd=0x7f31da70a060)
    at prov/shm/src/smr_ep.c:710
#7  0x00007f323434759d in smr_generic_sendmsg (ep=0x27bf090,
    iov=0x7ffefb067800, desc=0x7ffefb0677d0, iov_count=1, addr=77,
    tag=4294967279, data=3661, context=0x3eade68, op=1,
    op_flags=16908288) at prov/shm/src/smr_msg.c:329
#8  0x00007f3234348378 in smr_tsenddata (ep_fid=0x27bf090,
    buf=0x3778440, len=840, desc=0x0, data=3661, dest_addr=77

Reading the code, the tx_buf was obtained by smr_get_txbuf, which pops the tx_buf from a freestack backed by smr_inject_pool. smr_inject_pool directs to a pointer in the peer_smr ep region with inject_offset. As far as I can tell, there is no error handling in this code path when the inject pool exhausts.

The segfault will disappear if I increase the shm rx size from the default 1024 to 4096, so I guess the segfault may be due to inject pool exhaustion and causes an illegal tx_buf to be used?

There was a PR https://github.com/ofiwg/libfabric/pull/8764 by @wzamazon to increase the rx size due to similar issue for the smr_cmd_ctx exhaustion issue. It was later replaced by the PR https://github.com/ofiwg/libfabric/pull/8766 that moves cmd_ctx from freestack to bufpool which can grow dynamically.

I am not sure whether we can apply the same strategy here because inject pool seems to be part of the bigger memory region mmaped from shm_open, and it was followed by sar pool contiguously.

@aingerson What do you think?

aingerson commented 1 year ago

@shijin-aws The inject pool usage is controlled by the cmd queue. There is a 1:1 relationship with the cmds in the queue and inject pools (both are on the receiver side and are allocated to a size of rx_size) so if we are able to grab a cmd (which does have error checking) then it is assumed that we will be able to grab an inject buffer as well. It doesn't look like the inject buffer it grabbed is NULL so it likely did return something but it wasn't the correct thing. My initial thought is maybe an issue with popping the last element in the freestack has a bug. I can take a look and see if anything jumps out. Thanks for finding!

shijin-aws commented 1 year ago

The inject pool usage is controlled by the cmd queue. There is a 1:1 relationship with the cmds in the queue and inject pools (both are on the receiver side and are allocated to a size of rx_size) so if we are able to grab a cmd (which does have error checking) then it is assumed that we will be able to grab an inject buffer as well.

I see, so when the cmd queue is full, smr_cmd_queue_next will return ENOENT in https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_msg.c#L301-L303 and makes smr_generic_sendmsg returns EAGAIN. This makes sense to me. So there is either something up with the pop, or the smr_cmd_queue_next returns FI_SUCCESS incorrectly? I am actively debugging this issue now, do you have any suggestions (printing) I could try?

aingerson commented 1 year ago

There could be an issue in the unexpected path. We used to have a cmd_cnt to protect against this. This is how it used to go: Get cmd, decrement cmd_cnt, allocate inject buf If the receiver finds a match, copy out of buffer, push inject buffer, discard cmd, and increment cmd cnt If the receiver does not find a match, copy over cmd information to process later, discard cmd but do NOT push inject buffer or increment count. When a rx buffer is posted and we process the inject buffer, copy contents, push inject buffer, and finally increment cmd cnt

We got rid of the cmd_cnt in the atomic queue switch and I think that could be the problem. We might have to bring it back or restructure the architecture to allow this. I'm playing around with a rework of the cmd/hdr/buffer allocation that we could incorporate this into, but if you want a fix ASAP, you could bring back the cmd_cnt that protects when we can send. Or you could add error checking in the fs allocation.

shijin-aws commented 1 year ago

but if you want a fix ASAP, you could bring back the cmd_cnt that protects when we can send

I am trying my best to get it back, but I am hitting some hang and still debugging it,.

Or you could add error checking in the fs allocation.

Do you mean fs pop? e.g. check whether the txbuf is beyond the boundary of inject pool?

shijin-aws commented 1 year ago

I made a little progress. I was able to find some tx_buf obtained by freestack pop caused fs->free to be negative before the segfault happens

...
smr_get_txbuf: smr_inject_pool->free is negative: -1
smr_get_txbuf: smr_inject_pool->free is negative: -1
smr_get_txbuf: smr_inject_pool->free is negative: -961
[queue-hpc6a48xlarge-st-hpc6a48xlarge-10:87964] *** Process received signal ***
[queue-hpc6a48xlarge-st-hpc6a48xlarge-10:87964] Signal: Segmentation fault (11)
[queue-hpc6a48xlarge-st-hpc6a48xlarge-10:87964] Signal code:  (128)
shijin-aws commented 1 year ago

If the receiver does not find a match, copy over cmd information to process later, discard cmd but do NOT push inject buffer or increment count. When a rx buffer is posted and we process the inject buffer, copy contents, push inject buffer, and finally increment cmd cnt

I see, this is definitely an issue. Because the push is delayed in unexpected path, the inject pool will be drained while cmd queue is not.

shijin-aws commented 1 year ago

I had an experimental patch like this

diff --git a/prov/shm/src/smr_ep.c b/prov/shm/src/smr_ep.c
index d4e9bdfc7..b7e5a162a 100644
--- a/prov/shm/src/smr_ep.c
+++ b/prov/shm/src/smr_ep.c
@@ -233,6 +233,9 @@ static void smr_send_name(struct smr_ep *ep, int64_t id)
        ret = smr_cmd_queue_next(smr_cmd_queue(peer_smr), &ce, &pos);
        if (ret == -FI_ENOENT)
                return;
+       if (smr_inject_pool(peer_smr)->free <= 0)
+               return;
+
        tx_buf = smr_get_txbuf(peer_smr);

        ce->cmd.msg.hdr.op = SMR_OP_MAX + ofi_ctrl_connreq;
@@ -704,6 +707,9 @@ static ssize_t smr_do_inject(struct smr_ep *ep, struct smr_region *peer_smr, int
 {
        struct smr_inject_buf *tx_buf;

+       if (smr_inject_pool(peer_smr)->free <= 0)
+               return -FI_EAGAIN;
+
        tx_buf = smr_get_txbuf(peer_smr);

        smr_generic_format(cmd, peer_id, op, tag, data, op_flags);

that returns EAGAIN when the inject pool is drained. But I got a hang. Ompi will poll cq and retry fi_tsend when it returns EAGAIN. So that means the tx buf never got pushed back in progress engine.

aingerson commented 1 year ago

@shijin-aws Note that the code there is not locked so is racy. You would have to put the check in the call that gets the tx buf (and locks the region). We should also probably add an assert(free) in the pop path. Are you testing with or without the util srx code? I can get a cmd_cnt patch for you to see if that has different behavior

shijin-aws commented 1 year ago

@aingerson Hi, I am using main branch, i.e. no util srx. Using util srx has the same issue though

shijin-aws commented 1 year ago

You would have to put the check in the call that gets the tx buf (and locks the region).

New patch: https://github.com/shijin-aws/libfabric/commit/d1dae4d45369e9cee43cadef159df2170c406818

I added some prints https://github.com/shijin-aws/libfabric/commit/3bd4592898d7d56936b499a2766e611c9ac33e3d and I can see a lot of negative free. So somehow this value is already messed outside smr_get_txbuf. I also hit a hang with this patch.

shijin-aws commented 1 year ago

I can get a cmd_cnt patch for you to see if that has different behavior

@aingerson That will be greatly appreciated!

aingerson commented 1 year ago

@shijin-aws Tried to add the cmd_cnt back but it's proving to be much more complicated since we took out the lock. Can you try this patch that adds error handling on the inject buffer allocation? Do you still see it hitting a negative free count? https://github.com/aingerson/libfabric/tree/shm

shijin-aws commented 1 year ago

@aingerson Thanks! I tried your patch. The negative free disappears, but I am still hitting hangs. Somehow the txbuf was not pushed back or MPI was not retrying the send correctly. I am looking into it.

lrbison commented 1 year ago

@aingerson Thanks for your patch. Turns out @shijin-aws and I stumbled into this same problem from different applications. After reading this thread this morning I independently came up with nearly the same patch as yours: https://github.com/lrbison/libfabric/commit/843688014fc12169759e02cfcf65d9bae2c692aa

I actually found that my patch worked but yours did not. In the end, I was able to isolate the reason to some an extra logging.

Sure enough, If I simply usleep(500) when I find the stack to be empty, the job runs to completion.

I am still trying to figure out precisely what the sleep is avoiding.

aingerson commented 1 year ago

@lrbison That's interesting. Where is your usleep? Are you sleeping before failing and then trying again within the send?

lrbison commented 1 year ago

essentially, right here in smr_get_txbuf. No special retries other than the -FI_EAGAIN that you already do.

I should mention, I'm working in the 1.18.x branch for this, and your patch applies cleanly to it.

lrbison commented 1 year ago

@aingerson It seems that without the sleep, we must remove this line from smr_progress, so that we always do progress regardless of ep->region->signal.

aingerson commented 1 year ago

@lrbison Hmm that likely means we aren't signaling properly somewhere. Can you try adding a signal on the eagain case on the message case? I notice that the rma path has something similar:

    ret = smr_cmd_queue_next(smr_cmd_queue(peer_smr), &ce, &pos);
    if (ret == -FI_ENOENT) {
        /* kick the peer to process any outstanding commands */
        ret = -FI_EAGAIN;
        goto signal;
    }

I'm trying to think of why that would be needed. This is likely compensation for a signal missing elsewhere.

shijin-aws commented 1 year ago

Removing that line fixed the hang in my end as well. Luke and I guess we would have to do smr_signal(smr) before returning EAGAIN, otherwise the inject pool will never be pushed back.

lrbison commented 1 year ago

Yes. I was trying that just now. It seems to me that smr_get_txbuf should signal now that it recognizes the freestack is empty.

lrbison commented 1 year ago

Confirmed that as you suspected, signaling on eagain case fixes the hang. I check before signal and indeed there are times when it was false but freestack was empty.

I am preparing a PR to do as you say, based on your initial patch.

aingerson commented 1 year ago

Thanks for testing it! We can definitely patch that up now but as I said I think that signal is compensating for a missing signal somewhere else where it should be. When a peer pops a tx buf and sends it to a peer there should always be a signal that tells the peer to process that message. If we rely on the next message doing that signaling, that's problematic.

lrbison commented 1 year ago

@aingerson I wonder if you could review the PR I put up to handle this issue. We are seeing some customer workloads impacted from our latest software using the 1.18.x branch, and we would like to cut a new release after these changes are backported to 1.18.x.

Much appreciated!

lrbison commented 1 year ago

And yes, I agree the root cause seems to be there must be a path were signal was not getting called. I think my changes make it more likely that we catch these paths as well, but agree it's not a smoking gun for the real root cause.

aingerson commented 1 year ago

Absolutely, let's definitely provide what we can for a fix but also put a pin in this to identify the actual issue as well. Thank you!

j-xiong commented 4 months ago

Label it as enhancement since a fix is already in place but more investigation is needed to find out the root cause.

aingerson commented 4 months ago

@j-xiong I believe this was resolved with the removal of the signal + unexpected inject buffering. @shijin-aws could you confirm?