Regression bug when testing Open MPI with Intel MPI Benchmark (IMB-MPI1) - Githubissues

ofiwg / libfabric

Open Fabric Interfaces

http://libfabric.org/

Other

571 stars 380 forks source link

Regression bug when testing Open MPI with Intel MPI Benchmark (IMB-MPI1) #8850

Closed jtamzn closed 1 year ago

jtamzn commented 1 year ago

Describe the bug The commit 0c3a318965b43222665b4199244c13c5e04476b6 introduced bugs which cause multiple Intel MPI Benchmark failing with slow performance and invalid buffer error in Open MPI (e.g. 4.1.5)

To Reproduce Steps to reproduce the behavior: Compile libfabric >= 0c3a318965b43222665b4199244c13c5e04476b6, OpenMPI, and IMB. Run AllGather test

Expected behavior No error

Output

#-----------------------------------------------------------------------------
# Benchmarking Allgather 
# #processes = 1152 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
            0          200         0.04         0.08         0.05         0.00
            1          200       180.20       405.63       288.12         0.00
            2          200       182.70       405.70       287.10         0.00
            4          200       203.16       422.27       307.89         0.00
            8          200       207.56       404.58       304.08         0.00
           16          200       241.63       403.29       325.99         0.00
           32          200       306.28       473.48       397.08         0.00
           64          200       422.44       581.09       511.91         0.00
          128          200       651.80       832.19       749.69         0.00
          256          200      1136.97      1371.71      1254.20         0.00
          512          200      1855.58      2587.34      2324.83         0.00
         1024          200      3907.33      5193.83      4577.45         0.00
         2048          200      7225.08     11462.25      9764.20         0.00
         4096          200     13552.02     15685.38     14228.83         0.00
127: Error Allgather,size = 9437184,sample #0
90: Error Allgather,size = 9437184,sample #0
Process 90: Got invalid buffer: 
Buffer entry: 0.100000 
pos: 7077888
Process 90: Expected    buffer: 
Buffer entry: 0.000000 
89: Error Allgather,size = 9437184

Environment: This regression is verified on this in c5n.18xlarge instances on AWS, but we saw same error with multiple instance types during CI testing.

Additional context It could be related to MR cache search.

wzamazon commented 1 year ago

@amirshehataornl

amirshehataornl commented 1 year ago

The only way this could cause a problem as far as I can tell is if ofi_mr_info is being allocated on the stack (or heap with malloc) and not memset to 0. peer_id might contain some random values and could cause failures. I looked through the code for where that could happen and only found the place below where I missed memsetting info to 0. But I don't see this function used any where in the code

index 7b3a4404b..3ee926603 100644
--- a/prov/util/src/util_mr_cache.c
+++ b/prov/util/src/util_mr_cache.c
@@ -391,6 +391,8 @@ struct ofi_mr_entry *ofi_mr_cache_find(struct ofi_mr_cache *cache,
        struct ofi_mr_info info;
        struct ofi_mr_entry *entry;

+       memset(&info, 0, sizeof(info));
+
        assert(attr->iov_count == 1);
        FI_DBG(cache->domain->prov, FI_LOG_MR, "find %p (len: %zu)\n",
               attr->mr_iov->iov_base, attr->mr_iov->iov_len);

jtamzn commented 1 year ago

This patch reduced the error, but doesn't eliminate all of them. some peer_id is still non-zero and it is a constant value there. I attached the log as below, FYI I added a print in search function to print peer_id out. You can see once info->peer_id != 0, error starts popup a.log

amirshehataornl commented 1 year ago

where did you add this log:

[1,4]<stdout>:DEBUG overlap info->peer_id = 18446744073709551560, entry->info.peer_id = 0
[1,4]<stdout>:DEBUG overlap info->peer_id = 18446744073709551560, entry->info.peer_id = 0

amirshehataornl commented 1 year ago

never mind found it.

amirshehataornl commented 1 year ago

can you try this patch

diff --git a/prov/util/src/util_mr_cache.c b/prov/util/src/util_mr_cache.c
index 7b3a4404b..7e8be3f52 100644
--- a/prov/util/src/util_mr_cache.c
+++ b/prov/util/src/util_mr_cache.c
@@ -168,8 +168,12 @@ static struct ofi_mr_entry *ofi_mr_rbt_overlap(struct ofi_rbmap *tree,
                                               const struct iovec *key)
 {
        struct ofi_rbnode *node;
+       struct ofi_mr_info info;
+
+       memset(&info, 0, sizeof(info));
+       info.iov = *key;

-       node = ofi_rbmap_search(tree, (void *) key,
+       node = ofi_rbmap_search(tree, (void *) &info,
                                util_mr_find_overlap);
        if (!node)
                return NULL;
@@ -391,6 +395,8 @@ struct ofi_mr_entry *ofi_mr_cache_find(struct ofi_mr_cache *cache,
        struct ofi_mr_info info;
        struct ofi_mr_entry *entry;

+       memset(&info, 0, sizeof(info));
+
        assert(attr->iov_count == 1);
        FI_DBG(cache->domain->prov, FI_LOG_MR, "find %p (len: %zu)\n",
               attr->mr_iov->iov_base, attr->mr_iov->iov_len);

jtamzn commented 1 year ago

Looks like this one is working.

amirshehataornl commented 1 year ago

Is performance comparable to runs prior to https://github.com/ofiwg/libfabric/commit/0c3a318965b43222665b4199244c13c5e04476b6 ? If everything looks good I'll push in a PR

jtamzn commented 1 year ago

on 4 nodes tests I don't see significant performance differences, given that actually the search if-then you added effectively doing nothing at this moment (i.e. both peer_ids are 0), we shouldn't expect any big change right?

amirshehataornl commented 1 year ago

I was more worried about the memset and assignment in ofi_mr_rbt_overlap(). Don't think it should be a huge deal. I'll put this patch in and see what others think.

amirshehataornl commented 1 year ago

created this PR: https://github.com/ofiwg/libfabric/pull/8854

jtamzn commented 1 year ago

Confirmed the PR is working.

shijin-aws commented 1 year ago

PR is merged. Resolving.