open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 859 forks source link

Minor memory leak in init #8166

Open ekeever1 opened 4 years ago

ekeever1 commented 4 years ago

Hello,

I am attacking minor memory leaks that are reported during a run of a trivial MPI test program during MPI init/finalize.

I have checked out 4.0.5rc2, repo revision: v4.0.5-19-gc82d677 to experiment upon. 'git submodule status' prints nothing.

It's compiled on a stock CentOS-7 machine (gcc-4.8.5) with a Xeon E5-2602 v2 processor. The test program runs on a node with centos-8, a dual AMD Epyc 7402 and 10GbE and 100Gb Mellanox Bluefield-2 Infiniband NICs.

For the test problem, the network doesn't matter - there is only one rank, and the problem occurs in init.

An example of the complaints from Valgrind is as follows,

==317245== 32 bytes in 1 blocks are definitely lost in loss record 30 of 99
==317245==    at 0x4C30EDB: malloc (vg_replace_malloc.c:309)
==317245==    by 0x5BA838C: opal_malloc (malloc.c:101)
==317245==    by 0x9C8A89F: _dstore_fetch (dstore_base.c:2215)
==317245==    by 0x9C8AE43: pmix_common_dstor_fetch (dstore_base.c:2303)
==317245==    by 0x9E950DC: (gds_ds21_base.c:122)
==317245==    by 0x88DC20B: _getfn_fastpath (pmix_client_get.c:691)
==317245==    by 0x88DA00A: OPAL_MCA_PMIX3X_PMIx_Get_nb (pmix_client_get.c:339)
==317245==    by 0x88D8AC3: OPAL_MCA_PMIX3X_PMIx_Get (pmix_client_get.c:111)
==317245==    by 0x88834F6: pmix3x_get (pmix3x_client.c:656)
==317245==    by 0x7C4C294: rte_init (ess_singleton_module.c:210)
==317245==    by 0x58C8713: orte_init (orte_init.c:271)
==317245==    by 0x4F880C3: ompi_mpi_init (ompi_mpi_init.c:518)

I have followed the allocation/leak back to pmix_client_get.c:341 where it does this, | 341 cbfunc(rc, ival, cbdata); | 342 / ownership of the memory in ival is passed to the | 343 user in the cbfunc, so don't release it here */ where cbfunc is _value_cbfunc. This copies the values in ival to cbdata and then returns, and so ival becomes a lost block.

It appears, in my trivial test, that every call that goes to pmix_client_get.c:341 has _value_cbfunc for cbfunc. If it weren't for the explicit comment I'd just put a free(ival) right there, but I assume there's a reason for the warning and I don't know the deep internals enough... should the free just go in _value_cbfunc?

Cheers, -- Erik

rhc54 commented 4 years ago

The "free" has to go in the level that requested the info, not down in PMIx - as the comment indicates, ownership of the allocated memory is passed to the caller, and it is their responsibility to release it when they are done with it.

ekeever1 commented 3 years ago

The problem is that as near as I can tell, the location of the comment is in fact the point at which the pointer stored in cbdata is lost after it's copied by value into kv.

Even if it were't lost at that point, the first call that isn't in the pmi layer,

│209        /* get our local rank from PMI */ 
│210         OPAL_MODEX_RECV_VALUE(ret, OPAL_PMIX_LOCAL_RANK,
│211                               ORTE_PROC_MY_NAME, &u16ptr, OPAL_UINT16); 

does not receive a pointer back. The call ends with the data being deposited into u16ptr.

rhc54 commented 3 years ago

Don't know what to advise, I'm afraid. If you look at that macro, it releases the returned object as it should:

#define OPAL_MODEX_RECV_VALUE(r, s, p, d, t)                                    \
    do {                                                                        \
        opal_value_t *_kv;                                                      \
        OPAL_OUTPUT_VERBOSE((1, opal_pmix_verbose_output,                       \
                            "%s[%s:%d] MODEX RECV VALUE FOR PROC %s KEY %s",    \
                            OPAL_NAME_PRINT(OPAL_PROC_MY_NAME),                 \
                            __FILE__, __LINE__,                                 \
                            OPAL_NAME_PRINT(*(p)), (s)));                       \
        if (OPAL_SUCCESS == ((r) = opal_pmix.get((p), (s), NULL, &(_kv)))) {    \
            if (NULL == _kv) {                                                  \
                (r) = OPAL_ERR_NOT_FOUND;                                       \
            } else {                                                            \
                (r) = opal_value_unload(_kv, (void**)(d), (t));                 \
                OBJ_RELEASE(_kv);                                               \
           }                                                                    \
        }                                                                       \
    } while(0);

Reading thru the code, it all looks correct. The opal_value_unload function simply uses memcpy to move the value in _kv into the memory location pointed to by u16ptr, so no malloc there.

Could be that there is an extra malloc somewhere in the dstore lookup chain in PMIx. I don't have time right now to look for it (and the line numbers don't match in my copy), but you could try following that path. I would disregard the reference to opal_malloc as OMPI's memory interceptor is in operation there - i.e., the actual malloc could be called from anywhere, including down in PMIx.

You also might want to try this again with the latest v4.0.x git head as we just committed a major update to the PMIx code. It's possible that this might have been resolved - worth taking a look.

ekeever1 commented 3 years ago

It's pmix_client_get.c:340...

    if (PMIX_SUCCESS == (rc = _getfn_fastpath(&p, key, iptr, nfo, &ival))) {
        if (NULL != cbfunc) {
            cbfunc(rc, ival, cbdata);
            /* ownership of the memory in ival is passed to the
             * user in the cbfunc, so don't release it here */
        }

I think it's that the comment is just wrong in certain cases. If _value_cbfunc is used, the data is copied from ival to cbdata (and cbdata is then properly freed later), but ival is lost - that's the alloc that's getting lost.

My original plan was to do if(cbfunc == _value_cbfunc) { free(ival); } which would clean up about 10 losses associated with this.

I will check my fixes against latest - I currently have 13 leaks fixed.

ggouaillardet commented 3 years ago

@ekeever1 In that case, why don't you simply free(ival) inside _value_cbfunc ? it seems this callback is only used in a single place (PMIx_Get_nb() fwiw)

jsquyres commented 3 years ago

I will check my fixes against latest - I currently have 13 leaks fixed.

Woo hoo!

rhc54 commented 3 years ago

I'm not sure this is the optimal fix - I'm trying something out and will post it here in a bit.

rhc54 commented 3 years ago

I think this should work and eliminate at least one unnecessary malloc operation:

diff --git a/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_get.c b/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_get.c
index 90b95d97ce..ade47bbfcd 100644
--- a/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_get.c
+++ b/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_get.c
@@ -444,11 +444,7 @@ static void _value_cbfunc(pmix_status_t status, pmix_value_t *kv, void *cbdata)
     PMIX_ACQUIRE_OBJECT(cb);
     cb->status = status;
     if (PMIX_SUCCESS == status) {
-        PMIX_BFROPS_COPY(rc, pmix_client_globals.myserver,
-                         (void**)&cb->value, kv, PMIX_VALUE);
-        if (PMIX_SUCCESS != rc) {
-            PMIX_ERROR_LOG(rc);
-        }
+        cb->value = kv;
     }
     PMIX_POST_OBJECT(cb);
     PMIX_WAKEUP_THREAD(&cb->lock);
@@ -846,9 +842,6 @@ static void _getnbfn(int fd, short flags, void *cbdata)
         }
         cb->cbfunc.valuefn(rc, val, cb->cbdata);
     }
-    if (NULL != val) {
-        PMIX_VALUE_RELEASE(val);
-    }
     PMIX_RELEASE(cb);
     return;
ekeever1 commented 3 years ago

I grabbed the latest from Git and can no longer configure/compile, looks like pmix_rename.h.in didn't get added?

============================================================================
== Configuration complete
============================================================================
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating include/pmix_version.h
config.status: error: cannot find input file: `include/pmix_rename.h.in'
configure: /bin/sh './configure' *failed* for opal/mca/pmix/pmix3x/pmix
checking if v3.x component is to be used... yes - using the internal v3.x library
configure: WARNING: INTERNAL PMIX FAILED TO CONFIGURE
configure: error: CANNOT CONTINUE

but since none of the changes collided with mine here's my current set of leak patches for review,

mempatch.txt

rhc54 commented 3 years ago

It looks like your git clone is confused - can you try a fresh clone? I did and everything built fine.

Not sure about some of those leak patches - some of them might be due to the leak I patched above. You might try adding that patch to a fresh git clone of the v4.0.x branch and then recheck your patch.