ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
526 stars 369 forks source link

rdm_tagged_bw is broken with OOB sync #10118

Open shijin-aws opened 1 week ago

shijin-aws commented 1 week ago

Happens to both main and v1.21.x (haven't checked older versions yet)

FI_LOG_LEVEL=warn fi_rdm_tagged_bw -p efa -b -j 0
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
64      20k     1.2m        0.06s     22.55       2.84       0.35
256     20k     4.8m        0.03s    201.48       1.27       0.79
1k      20k     19m         0.03s    813.89       1.26       0.79
libfabric:3180488:1719276361::efa:cq:efa_rdm_rxe_report_completion():762<warn> Message truncated! tag: 60030 incoming message size: 4096 receiving buffer size: 1024
[error] fabtests:common/shared.c:2904: cq_readerr 265 (Truncation error), provider errno: -265 (Unknown error)

Same tests failed with tcp provider similarly

ubuntu@ip-172-31-39-234:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ FI_LOG_LEVEL=warn fi_rdm_tagged_bw -p tcp -b -j 0
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
64      20k     1.2m        0.04s     31.00       2.06       0.48
256     20k     4.8m        0.05s     94.87       2.70       0.37
1k      20k     19m         0.07s    281.92       3.63       0.28
libfabric:3181046:1719276643::tcp:ep_data:xnet_handle_truncate():159<warn> msg recv truncated
[error] fabtests:common/shared.c:2904: cq_readerr 265 (Truncation error), provider errno: 11 (Resource temporarily unavailable)

So it should be a fabtests issue

shijin-aws commented 1 week ago

Non-OOB (-E) sync works.

rdm_tagged_pingpong, rma_bw works fine with OOB sync as well

j-xiong commented 1 week ago

Is it related to https://github.com/ofiwg/libfabric/pull/10108?

Update: probably not, since the commit is in main only.

shijin-aws commented 1 week ago

@j-xiong no, it seems a long-standing issue, I will dig into it .