ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
555 stars 376 forks source link

prov/efa: always use p2p for system memory #10392

Closed jiaxiyan closed 2 weeks ago

jiaxiyan commented 2 weeks ago

For system memory, it should use p2p even with unregistered buffer. Otherwise it will affect performance.

shijin-aws commented 2 weeks ago

I am concerned that https://github.com/ofiwg/libfabric/commit/deaa841d887a2a5231b20228c01a69450706c19c may also be impacted by this issue. You checked

efa_rdm_ep_use_p2p(ep, rxe->desc[0]);

to determine if we can do p2p for the receive buffer. If receive buffer is unregistered and on host, we should allow rdma read as well.

I think instead of adding such iface check everywhere, we should investigate whether we can just make efa_rdm_ep_use_p2p return true for unregistered buffer. You need to audit whether that change can cause any behavior change. My most concern is here https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/rdm/efa_rdm_pke_utils.c#L69. But it looks it is called when we have a valid iov_mr, so it shouldn't cause a problem if we make efa_rdm_ep_use_p2p return true for NULL mr.

shijin-aws commented 2 weeks ago

Need to backport

darrylabbate commented 2 weeks ago

I'm a little dubious of this change and can see it easily becoming a future bug. If the null txe->desc[0] being passed here was netting an undesired result, wouldn't it limit the blast radius to simply pass true for use_p2p in efa_rdm_msg_select_rtm() when txe->desc[0] is null?

I see that we've walked through every caller of efa_rdm_ep_use_p2p(), but changing the logic for the null parameter just strikes me as something that will surprise (and not delight) us later.

shijin-aws commented 2 weeks ago

I'm a little dubious of this change and can see it easily becoming a future bug. If the null txe->desc[0] being passed here was netting an undesired result, wouldn't it limit the blast radius to simply pass true for use_p2p in efa_rdm_msg_select_rtm() when txe->desc[0] is null?

I see that we've walked through every caller of efa_rdm_ep_use_p2p(), but changing the logic for the null parameter just strikes me as something that will surprise (and not delight) us later.

I don't think it can become a bug, efa_rdm_ep_use_p2p should be interpreted as whether p2p can be used for a given MR. It is a valid deduction that NULL desc should be host memory, and we should always support p2p for host memory, I don't think it is wrong to make this function return true for NULL desc. It also saves changes everywhere to check desc[0] explicitly before calling efa_rdm_ep_use_p2p()