Closed jtamzn closed 1 year ago
@amirshehataornl
The only way this could cause a problem as far as I can tell is if ofi_mr_info is being allocated on the stack (or heap with malloc) and not memset to 0. peer_id might contain some random values and could cause failures. I looked through the code for where that could happen and only found the place below where I missed memsetting info to 0. But I don't see this function used any where in the code
index 7b3a4404b..3ee926603 100644
--- a/prov/util/src/util_mr_cache.c
+++ b/prov/util/src/util_mr_cache.c
@@ -391,6 +391,8 @@ struct ofi_mr_entry *ofi_mr_cache_find(struct ofi_mr_cache *cache,
struct ofi_mr_info info;
struct ofi_mr_entry *entry;
+ memset(&info, 0, sizeof(info));
+
assert(attr->iov_count == 1);
FI_DBG(cache->domain->prov, FI_LOG_MR, "find %p (len: %zu)\n",
attr->mr_iov->iov_base, attr->mr_iov->iov_len);
This patch reduced the error, but doesn't eliminate all of them. some peer_id is still non-zero and it is a constant value there. I attached the log as below, FYI I added a print in search function to print peer_id out. You can see once info->peer_id != 0, error starts popup a.log
where did you add this log:
[1,4]<stdout>:DEBUG overlap info->peer_id = 18446744073709551560, entry->info.peer_id = 0
[1,4]<stdout>:DEBUG overlap info->peer_id = 18446744073709551560, entry->info.peer_id = 0
never mind found it.
can you try this patch
diff --git a/prov/util/src/util_mr_cache.c b/prov/util/src/util_mr_cache.c
index 7b3a4404b..7e8be3f52 100644
--- a/prov/util/src/util_mr_cache.c
+++ b/prov/util/src/util_mr_cache.c
@@ -168,8 +168,12 @@ static struct ofi_mr_entry *ofi_mr_rbt_overlap(struct ofi_rbmap *tree,
const struct iovec *key)
{
struct ofi_rbnode *node;
+ struct ofi_mr_info info;
+
+ memset(&info, 0, sizeof(info));
+ info.iov = *key;
- node = ofi_rbmap_search(tree, (void *) key,
+ node = ofi_rbmap_search(tree, (void *) &info,
util_mr_find_overlap);
if (!node)
return NULL;
@@ -391,6 +395,8 @@ struct ofi_mr_entry *ofi_mr_cache_find(struct ofi_mr_cache *cache,
struct ofi_mr_info info;
struct ofi_mr_entry *entry;
+ memset(&info, 0, sizeof(info));
+
assert(attr->iov_count == 1);
FI_DBG(cache->domain->prov, FI_LOG_MR, "find %p (len: %zu)\n",
attr->mr_iov->iov_base, attr->mr_iov->iov_len);
Looks like this one is working.
Is performance comparable to runs prior to https://github.com/ofiwg/libfabric/commit/0c3a318965b43222665b4199244c13c5e04476b6 ? If everything looks good I'll push in a PR
on 4 nodes tests I don't see significant performance differences, given that actually the search if-then you added effectively doing nothing at this moment (i.e. both peer_ids are 0), we shouldn't expect any big change right?
I was more worried about the memset and assignment in ofi_mr_rbt_overlap(). Don't think it should be a huge deal. I'll put this patch in and see what others think.
created this PR: https://github.com/ofiwg/libfabric/pull/8854
Confirmed the PR is working.
PR is merged. Resolving.
Describe the bug The commit 0c3a318965b43222665b4199244c13c5e04476b6 introduced bugs which cause multiple Intel MPI Benchmark failing with slow performance and invalid buffer error in Open MPI (e.g. 4.1.5)
To Reproduce Steps to reproduce the behavior: Compile libfabric >= 0c3a318965b43222665b4199244c13c5e04476b6, OpenMPI, and IMB. Run AllGather test
Expected behavior No error
Output
Environment: This regression is verified on this in c5n.18xlarge instances on AWS, but we saw same error with multiple instance types during CI testing.
Additional context It could be related to MR cache search.