ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
547 stars 375 forks source link

Peer API: need a way to convert MR desc between owner and peer providers? #8936

Closed shijin-aws closed 1 year ago

shijin-aws commented 1 year ago

Working on making efa onboard the util SRX framework https://github.com/ofiwg/libfabric/pull/8907, one blocker I currently have is how do we convert MR desc between the owner and peer providers. Efa has an internal struct efa_mr https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/efa_mr.h#L53-L63, which has shm_mr as a member. When application calls fi_mr_regattr to efa, efa will call fi_mr_regattr to shm, and efa was able to retrieve the shm mr from the desc passed by application via fi_mr_desc((struct efa_mr *)desc->shm_mr).

One challenge I have right now is, if I simply call util_srx_generic_recv in efa's fi_recv, there is no good way I can do such mr desc translation before it calls the start_msg. Because application may call fi_recv with FI_ADDR_UNSPEC, so efa does not know whether the incoming message would be from intra-node or inter-node so it cannot do any translation for mr desc before call util_srx_generic_recv either.

Then when util_srx_generic_recv found a matched rx entry from the unexpected queue, it will update the rx entry with the desc only readable by owner, while the start_msg could be the one for peer which cannot understand this desc. Before using util_srx_generic_recv, efa uses its own generic_recv has an extra step to update the desc in the rx_entry before calling the start_msg.

To make such MR desc update a general procedure for all provider and can be used by util SRX, I am thinking whether we could introduce a new function in the peer SRX's peer_ops like (use non-tagged as an example)

int update_msg(struct fid_peer_srx *peer_srx, struct fi_peer_rx_entry *rx_entry);

This function is called by owner before calling start_msg in the generic_recv. To fix the MR desc translation issue I mentioned above, efa could tell whether this rx_entry is queued by owner and peer by reading the rx_entry->srx and compare it with the peer_srx argument in the call, which should be owner's peer_srx in this case, and update the desc in the rx_entry accordingly.

How to implement update_msg is provider specific.

Another solution could be introducing a peer MR API, similar to peer AV and peer CQ. But I am not sure how it will look like right now.

It will be appreciated if I could get your feedback on this @aingerson @shefty

shijin-aws commented 1 year ago

Had offline discussion with @aingerson . We agree that it's not a issue in peer API but something with the util SRX framework. The current idea to solve this issue is to introduce a update_mr(srx, rx_entry) function pointer in util_srx_ctx so owner could update the rx entry passed to peer.

I will move the discussion to https://github.com/ofiwg/libfabric/pull/8907 and close this issue.