Closed shijin-aws closed 1 year ago
Or maybe making sockets run this test with -c sread
to see if it makes it stable, but I am not sure which config file I should modify for this local change.
I'd support disabling the test.
Re-open this issue as we have seen it failed for tcp as well. There could be some issue in the test.
I was not able to reproduce this issue at my machine so far. I have run this test for both tcp and sockets for 1000 times and did not get any failure.
Looking at the error message
[error] fabtests:common/shared.c:2310: Tag mismatch!. Expected: 2748, actual: 1
2748 is 0xabc
, so it is failing here https://github.com/ofiwg/libfabric/blob/main/fabtests/functional/rdm_tagged_peek.c#L332
The message received with tag 1 should be from the fi_tsendmsg in ft_ep_finalize()
, which send a tmsg with tag as tx_seq, and tx_seq is 1 at this moment. But this tmsg is posted after the earlier fi_tsend with tag 0xabc, so somehow fi_cq_read
get a comp with tag 1 first.
Is the test using separate tx and rx cq's?
Is the test using separate tx and rx cq's?
Yes. And AFAICT, before the recent change, this test did not do any tag validation after fi_cq_sread
.
A send completion doesn't set the tag value. It's only set for a receive completion. I'm reading through the code, and I don't see what the issue could be. It definitely comes across as having a subtle race somewhere, but I don't see where.
Should we add hints->rx_attr->msg_order = FI_ORDER_SAS;
in the test?
I ran this through the CI 100 times and reproduced the error and then I ran it through the CI 100 times with a forced consumption of the preposted buffer and did not see the error. It's pretty random so I'm not positive if that's for sure a fix but there could be a problem with the rx_seq since there is an extra receive buffer that has not been consummed.
No, the test is checking for receives being handled out of order.
Should we add hints->rx_attr->msg_order = FI_ORDER_SAS; in the test?
EFA enforce SAS in rx_attr by default. Do tcp/sockets do the same?
No, but they're layered over streams, which is SAS ordering. SAS doesn't guarantee that completions at the receiver must be written in that order. But, I think this might be it:
The fi_tsend() at line 317 can complete prior the server calling trecv_op() on line 332. That means the client can proceed to ft_finalize() and post the next message. Depending on how slow the server is, that finalize message can find a match prior to trecv_op() being posted. Once trecv_op() is posted, the completion that we retrieve may actually be the finalize message completing.
@shefty I think you're right. Adding a forced sleep before the recv on 332 reproduces the error.
Depending on how slow the server is, that finalize message can find a match prior to trecv_op() being posted. Once trecv_op() is posted, the completion that we retrieve may actually be the finalize message completing.
Thanks, I think it makes sense. There is a trecv with tag 1 posted earlier in ft_init_av_dst_addr
@shefty I think you're right. Adding a forced sleep before the recv on 332 reproduces the error.
Interestingly, I still get test passed after I add that sleep
@@ -329,6 +330,7 @@ static int run(void)
if (ret)
return ret;
+ sleep(10);
ret = trecv_op(0xabc, 0, false);
if (ret) {
FT_PRINTERR("Receive sync", ret);
weird. I can get it consistently failing with a sleep(1) exactly there with both debug and regular builds.
I'm wondering if we even need that 0xabc message exchange. And if we eliminate it, would that push the failure up to the previous message? My guesses are no and yes.
weird. I can get it consistently failing with a sleep(1) exactly there with both debug and regular builds.
I tested other provider. tcp still passed, sockets fails consistently.
I'm thinking (dangerous habit) we can replace ft_finalize() with an OOB sync.
I'm wondering if we even need that 0xabc message exchange. And if we eliminate it, would that push the failure up to the previous message? My guesses are no and yes.
I agree, if we eliminate this exchange, the last recv in do_recvs() could hit this mismatch as well.
I'm thinking (dangerous habit) we can replace ft_finalize() with an OOB sync.
If you only do finalize OOB, there will be an outstanding rx during ep clean up, which is fine, but there will be a warning in efa's log.
I don't think that matters. There are usually outstanding messages in the non-tagged case.
Can't we remove ft_finalize() but keep the 0xabc
exchange?
Then client will only send 1 0xabc
, so we should not hit such issue.
@shefty @shijin-aws How does this look? 57e7ebf148b7b104ea60fa18ff0a57030a5e61d4
looks good to me.
The 0xabc message doesn't have the right completion semantics, so simply dropping ft_finalize() may be racy still. @aingerson - I think that works.
If we changed the tsend 0xabc to delivery_complete, I think it would work. (But then, I thought the current code would work too.)
PR is merged, close for now.
We have observed the recent changes in
rdm_tagged_peek
(switch from fi_cq_sread to fi_cq_read by default) makes sockets provider fail this test randomly.I don't think it's a test issue as tcp, shm and efa all pass this test consistently. I know that sockets provider is not maintained very actively, but we have customers in AWS that uses sockets provider for some basic functionality so we still keep it tested in our CI.
Is community OK with skipping this test for sockets provider by modifying the
sockets.exclude
?