ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
555 stars 376 forks source link

peer CQ: recommended way to convert address? #8504

Closed shijin-aws closed 1 year ago

shijin-aws commented 1 year ago

According to the fi_peer man page, for fi_ops_cq_owner::write() , the owner is responsible for converting the address if source addressing is needed.

For example, the fi_addr_t is relative to the peer’s AV. The owner is responsible for converting the address if source addressing is needed

Recently FI_PEER support is added to util_cq https://github.com/ofiwg/libfabric/commit/6ba44d92b61c3abedbebe9a834ae62bca2d5ee9a, which implemented util_peer_cq_src_owner_ops as the peer_cq's owner_ops for FI_SOURCE mode. It seems this util op does not handle the address convert yet so we are researching what's the recommended way to implement that.

In EFA provider, we used to maintain an internal array shm_rdm_addr_map that maps the shm address to efa address. When application calls fi_av_insert, if the peer is local, efa provider call fi_av_insert(shm_av, ..., &shm_address) , and update this map as shm_rdm_addr_map[shm_address] = efa_address. Then in efa's progress engine, we call fi_cq_readfrom(..., &shm_addr) to shm cq, and write the entry to efa CQ with mapped efa address.

Now if using peer cq, efa exports its cq to shm, shm will use owner_ops->write to write cq entry to efa cq directly. efa provider will only do fi_cq_read for shm cq to progress data manually. This fact makes me think the address convert needs to land in the owner_ops->write. I do not see a way that the owner could tell which provider is writing the cq entry according to the input parameters in owner_ops->write. A working but "bad" solution could be making efa create its own write op and do the convert explicitly for shm before calling the util_peer_cq_write_src, as long as efa is only using shm as a peer provider.

Another potential solution could be using peer AV. Though it's not clear to me how it software flow could be and there is no example in the man page yet. But the man page seems to indicate that a peer AV is not favored in efa + shm case

.... It specifically targets the use case of having a main provider paired with an offload provider, where the offload provider leverages the communication that has already been established through the main provider. In other situations, such as that mentioned above pairing a tcp provider with a shared memory provider, each peer will likely have their own AV that is not shared.

It will be really appreciated if you @shefty @aingerson could give me some feedback. Thank you!

aingerson commented 1 year ago

I think the man page is wrong. I don't think the owner is (should be) expected to handle. I think the expectation is that every peer inserts the correct source address. We've added FI_AV_USER_ID to help enable this. This allows the creator of an AV to pass in an fi_addr to report back when writing the completion. It is implemented in shm here. Basically the intent is that efa would pass in the application/efa fi_addr in the av_insert fi_addr array to shm along with the FI_AV_USER_ID flag and shm would pass that inputed address into the util_cq write function

shijin-aws commented 1 year ago

@aingerson Thank you for the insights! I will look into it and it looks promising.

shijin-aws commented 1 year ago

@aingerson I tested the FI_AV_USER_ID flag for shm av. But should this patch be applied to make it work?

(venv) [ec2-user@queue-c5n18xlarge-dy-c5n18xlarge-1 libfabric]$ git diff prov/shm/
diff --git a/prov/shm/src/smr_av.c b/prov/shm/src/smr_av.c
index ce6f83ae7..c32073b5d 100644
--- a/prov/shm/src/smr_av.c
+++ b/prov/shm/src/smr_av.c
@@ -91,7 +91,7 @@ static int smr_av_insert(struct fid_av *av_fid, const void *addr, size_t count,
                }

                FI_INFO(&smr_prov, FI_LOG_AV, "fi_addr: %" PRIu64 "\n", util_addr);
-               if (fi_addr)
+               if (fi_addr && !(flags & FI_AV_USER_ID))
                        fi_addr[i] = util_addr;

Otherwise the fi_addr passed as the input will be disregarded and shm will still use the util_addr.

shijin-aws commented 1 year ago

I opened a PR https://github.com/ofiwg/libfabric/pull/8546.

shijin-aws commented 1 year ago

Close this issue and the discussion is in https://github.com/ofiwg/libfabric/pull/8546