Closed shijin-aws closed 1 year ago
I think the man page is wrong. I don't think the owner is (should be) expected to handle. I think the expectation is that every peer inserts the correct source address. We've added FI_AV_USER_ID to help enable this. This allows the creator of an AV to pass in an fi_addr to report back when writing the completion. It is implemented in shm here. Basically the intent is that efa would pass in the application/efa fi_addr in the av_insert fi_addr array to shm along with the FI_AV_USER_ID flag and shm would pass that inputed address into the util_cq write function
@aingerson Thank you for the insights! I will look into it and it looks promising.
@aingerson I tested the FI_AV_USER_ID flag for shm av. But should this patch be applied to make it work?
(venv) [ec2-user@queue-c5n18xlarge-dy-c5n18xlarge-1 libfabric]$ git diff prov/shm/
diff --git a/prov/shm/src/smr_av.c b/prov/shm/src/smr_av.c
index ce6f83ae7..c32073b5d 100644
--- a/prov/shm/src/smr_av.c
+++ b/prov/shm/src/smr_av.c
@@ -91,7 +91,7 @@ static int smr_av_insert(struct fid_av *av_fid, const void *addr, size_t count,
}
FI_INFO(&smr_prov, FI_LOG_AV, "fi_addr: %" PRIu64 "\n", util_addr);
- if (fi_addr)
+ if (fi_addr && !(flags & FI_AV_USER_ID))
fi_addr[i] = util_addr;
Otherwise the fi_addr
passed as the input will be disregarded and shm will still use the util_addr.
I opened a PR https://github.com/ofiwg/libfabric/pull/8546.
Close this issue and the discussion is in https://github.com/ofiwg/libfabric/pull/8546
According to the fi_peer man page, for
fi_ops_cq_owner::write()
, the owner is responsible for converting the address if source addressing is needed.Recently
FI_PEER
support is added toutil_cq
https://github.com/ofiwg/libfabric/commit/6ba44d92b61c3abedbebe9a834ae62bca2d5ee9a, which implementedutil_peer_cq_src_owner_ops
as the peer_cq's owner_ops forFI_SOURCE
mode. It seems this util op does not handle the address convert yet so we are researching what's the recommended way to implement that.In EFA provider, we used to maintain an internal array
shm_rdm_addr_map
that maps the shm address to efa address. When application callsfi_av_insert
, if the peer is local, efa provider callfi_av_insert(shm_av, ..., &shm_address)
, and update this map asshm_rdm_addr_map[shm_address] = efa_address
. Then in efa's progress engine, we callfi_cq_readfrom(..., &shm_addr)
to shm cq, and write the entry to efa CQ with mapped efa address.Now if using peer cq, efa exports its cq to shm, shm will use
owner_ops->write
to write cq entry to efa cq directly. efa provider will only dofi_cq_read
for shm cq to progress data manually. This fact makes me think the address convert needs to land in theowner_ops->write
. I do not see a way that the owner could tell which provider is writing the cq entry according to the input parameters inowner_ops->write
. A working but "bad" solution could be making efa create its own write op and do the convert explicitly for shm before calling theutil_peer_cq_write_src
, as long as efa is only using shm as a peer provider.Another potential solution could be using peer AV. Though it's not clear to me how it software flow could be and there is no example in the man page yet. But the man page seems to indicate that a peer AV is not favored in efa + shm case
It will be really appreciated if you @shefty @aingerson could give me some feedback. Thank you!