ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
555 stars 376 forks source link

rdm_tagged_peek test fails randomly #8844

Closed shijin-aws closed 1 year ago

shijin-aws commented 1 year ago

We have observed the recent changes in rdm_tagged_peek (switch from fi_cq_sread to fi_cq_read by default) makes sockets provider fail this test randomly.

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.15.88 'timeout 360 /bin/bash --login -c '"'"'/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr8841-debug/install/fabtests/bin/fi_rdm_tagged_peek -p sockets -s 172.31.15.88'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.0.74 'timeout 360 /bin/bash --login -c '"'"'/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr8841-debug/install/fabtests/bin/fi_rdm_tagged_peek -p sockets -s 172.31.0.74 172.31.15.88'"'"''
client_stdout:

client returncode: None
Client error: Client timed out
server_stdout:
[error] fabtests:common/shared.c:2310: Tag mismatch!. Expected: 2748, actual: 1
fi_cq_read/fi_cq_sread(): functional/rdm_tagged_peek.c:107, ret=-256 (Unspecified error)
Receive sync(): functional/rdm_tagged_peek.c:334, ret=-256 (Unspecified error)
Sending 10 tagged messages
Waiting for messages to complete

server returncode: 1

I don't think it's a test issue as tcp, shm and efa all pass this test consistently. I know that sockets provider is not maintained very actively, but we have customers in AWS that uses sockets provider for some basic functionality so we still keep it tested in our CI.

Is community OK with skipping this test for sockets provider by modifying the sockets.exclude?

shijin-aws commented 1 year ago

Or maybe making sockets run this test with -c sread to see if it makes it stable, but I am not sure which config file I should modify for this local change.

shefty commented 1 year ago

I'd support disabling the test.

shijin-aws commented 1 year ago

Thanks https://github.com/ofiwg/libfabric/pull/8845

shijin-aws commented 1 year ago

Re-open this issue as we have seen it failed for tcp as well. There could be some issue in the test.

shijin-aws commented 1 year ago

I was not able to reproduce this issue at my machine so far. I have run this test for both tcp and sockets for 1000 times and did not get any failure.

Looking at the error message

[error] fabtests:common/shared.c:2310: Tag mismatch!. Expected: 2748, actual: 1

2748 is 0xabc, so it is failing here https://github.com/ofiwg/libfabric/blob/main/fabtests/functional/rdm_tagged_peek.c#L332

shijin-aws commented 1 year ago

The message received with tag 1 should be from the fi_tsendmsg in ft_ep_finalize(), which send a tmsg with tag as tx_seq, and tx_seq is 1 at this moment. But this tmsg is posted after the earlier fi_tsend with tag 0xabc, so somehow fi_cq_read get a comp with tag 1 first.

shefty commented 1 year ago

Is the test using separate tx and rx cq's?

shijin-aws commented 1 year ago

Is the test using separate tx and rx cq's?

Yes. And AFAICT, before the recent change, this test did not do any tag validation after fi_cq_sread.

shefty commented 1 year ago

A send completion doesn't set the tag value. It's only set for a receive completion. I'm reading through the code, and I don't see what the issue could be. It definitely comes across as having a subtle race somewhere, but I don't see where.

shijin-aws commented 1 year ago

Should we add hints->rx_attr->msg_order = FI_ORDER_SAS; in the test?

aingerson commented 1 year ago

I ran this through the CI 100 times and reproduced the error and then I ran it through the CI 100 times with a forced consumption of the preposted buffer and did not see the error. It's pretty random so I'm not positive if that's for sure a fix but there could be a problem with the rx_seq since there is an extra receive buffer that has not been consummed.

shefty commented 1 year ago

No, the test is checking for receives being handled out of order.

shijin-aws commented 1 year ago

Should we add hints->rx_attr->msg_order = FI_ORDER_SAS; in the test?

EFA enforce SAS in rx_attr by default. Do tcp/sockets do the same?

shefty commented 1 year ago

No, but they're layered over streams, which is SAS ordering. SAS doesn't guarantee that completions at the receiver must be written in that order. But, I think this might be it:

The fi_tsend() at line 317 can complete prior the server calling trecv_op() on line 332. That means the client can proceed to ft_finalize() and post the next message. Depending on how slow the server is, that finalize message can find a match prior to trecv_op() being posted. Once trecv_op() is posted, the completion that we retrieve may actually be the finalize message completing.

aingerson commented 1 year ago

@shefty I think you're right. Adding a forced sleep before the recv on 332 reproduces the error.

shijin-aws commented 1 year ago

Depending on how slow the server is, that finalize message can find a match prior to trecv_op() being posted. Once trecv_op() is posted, the completion that we retrieve may actually be the finalize message completing.

Thanks, I think it makes sense. There is a trecv with tag 1 posted earlier in ft_init_av_dst_addr

shijin-aws commented 1 year ago

@shefty I think you're right. Adding a forced sleep before the recv on 332 reproduces the error.

Interestingly, I still get test passed after I add that sleep

@@ -329,6 +330,7 @@ static int run(void)
                if (ret)
                        return ret;

+               sleep(10);
                ret = trecv_op(0xabc, 0, false);
                if (ret) {
                        FT_PRINTERR("Receive sync", ret);
aingerson commented 1 year ago

weird. I can get it consistently failing with a sleep(1) exactly there with both debug and regular builds.

shefty commented 1 year ago

I'm wondering if we even need that 0xabc message exchange. And if we eliminate it, would that push the failure up to the previous message? My guesses are no and yes.

shijin-aws commented 1 year ago

weird. I can get it consistently failing with a sleep(1) exactly there with both debug and regular builds.

I tested other provider. tcp still passed, sockets fails consistently.

shefty commented 1 year ago

I'm thinking (dangerous habit) we can replace ft_finalize() with an OOB sync.

shijin-aws commented 1 year ago

I'm wondering if we even need that 0xabc message exchange. And if we eliminate it, would that push the failure up to the previous message? My guesses are no and yes.

I agree, if we eliminate this exchange, the last recv in do_recvs() could hit this mismatch as well.

shijin-aws commented 1 year ago

I'm thinking (dangerous habit) we can replace ft_finalize() with an OOB sync.

If you only do finalize OOB, there will be an outstanding rx during ep clean up, which is fine, but there will be a warning in efa's log.

shefty commented 1 year ago

I don't think that matters. There are usually outstanding messages in the non-tagged case.

shijin-aws commented 1 year ago

Can't we remove ft_finalize() but keep the 0xabc exchange?

shijin-aws commented 1 year ago

Then client will only send 1 0xabc, so we should not hit such issue.

aingerson commented 1 year ago

@shefty @shijin-aws How does this look? 57e7ebf148b7b104ea60fa18ff0a57030a5e61d4

shijin-aws commented 1 year ago

looks good to me.

shefty commented 1 year ago

The 0xabc message doesn't have the right completion semantics, so simply dropping ft_finalize() may be racy still. @aingerson - I think that works.

shefty commented 1 year ago

If we changed the tsend 0xabc to delivery_complete, I think it would work. (But then, I thought the current code would work too.)

shijin-aws commented 1 year ago

Fixed by https://github.com/ofiwg/libfabric/pull/8859

shijin-aws commented 1 year ago

PR is merged, close for now.