Open ekeever1 opened 4 years ago
The "free" has to go in the level that requested the info, not down in PMIx - as the comment indicates, ownership of the allocated memory is passed to the caller, and it is their responsibility to release it when they are done with it.
The problem is that as near as I can tell, the location of the comment is in fact the point at which the pointer stored in cbdata is lost after it's copied by value into kv.
Even if it were't lost at that point, the first call that isn't in the pmi layer,
│209 /* get our local rank from PMI */
│210 OPAL_MODEX_RECV_VALUE(ret, OPAL_PMIX_LOCAL_RANK,
│211 ORTE_PROC_MY_NAME, &u16ptr, OPAL_UINT16);
does not receive a pointer back. The call ends with the data being deposited into u16ptr.
Don't know what to advise, I'm afraid. If you look at that macro, it releases the returned object as it should:
#define OPAL_MODEX_RECV_VALUE(r, s, p, d, t) \
do { \
opal_value_t *_kv; \
OPAL_OUTPUT_VERBOSE((1, opal_pmix_verbose_output, \
"%s[%s:%d] MODEX RECV VALUE FOR PROC %s KEY %s", \
OPAL_NAME_PRINT(OPAL_PROC_MY_NAME), \
__FILE__, __LINE__, \
OPAL_NAME_PRINT(*(p)), (s))); \
if (OPAL_SUCCESS == ((r) = opal_pmix.get((p), (s), NULL, &(_kv)))) { \
if (NULL == _kv) { \
(r) = OPAL_ERR_NOT_FOUND; \
} else { \
(r) = opal_value_unload(_kv, (void**)(d), (t)); \
OBJ_RELEASE(_kv); \
} \
} \
} while(0);
Reading thru the code, it all looks correct. The opal_value_unload
function simply uses memcpy
to move the value in _kv
into the memory location pointed to by u16ptr
, so no malloc
there.
Could be that there is an extra malloc
somewhere in the dstore lookup chain in PMIx. I don't have time right now to look for it (and the line numbers don't match in my copy), but you could try following that path. I would disregard the reference to opal_malloc
as OMPI's memory interceptor is in operation there - i.e., the actual malloc
could be called from anywhere, including down in PMIx.
You also might want to try this again with the latest v4.0.x git head as we just committed a major update to the PMIx code. It's possible that this might have been resolved - worth taking a look.
It's pmix_client_get.c:340...
if (PMIX_SUCCESS == (rc = _getfn_fastpath(&p, key, iptr, nfo, &ival))) {
if (NULL != cbfunc) {
cbfunc(rc, ival, cbdata);
/* ownership of the memory in ival is passed to the
* user in the cbfunc, so don't release it here */
}
I think it's that the comment is just wrong in certain cases. If _value_cbfunc is used, the data is copied from ival to cbdata (and cbdata is then properly freed later), but ival is lost - that's the alloc that's getting lost.
My original plan was to do
if(cbfunc == _value_cbfunc) { free(ival); }
which would clean up about 10 losses associated with this.
I will check my fixes against latest - I currently have 13 leaks fixed.
@ekeever1 In that case, why don't you simply free(ival)
inside _value_cbfunc
?
it seems this callback is only used in a single place (PMIx_Get_nb()
fwiw)
I will check my fixes against latest - I currently have 13 leaks fixed.
Woo hoo!
I'm not sure this is the optimal fix - I'm trying something out and will post it here in a bit.
I think this should work and eliminate at least one unnecessary malloc operation:
diff --git a/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_get.c b/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_get.c
index 90b95d97ce..ade47bbfcd 100644
--- a/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_get.c
+++ b/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client_get.c
@@ -444,11 +444,7 @@ static void _value_cbfunc(pmix_status_t status, pmix_value_t *kv, void *cbdata)
PMIX_ACQUIRE_OBJECT(cb);
cb->status = status;
if (PMIX_SUCCESS == status) {
- PMIX_BFROPS_COPY(rc, pmix_client_globals.myserver,
- (void**)&cb->value, kv, PMIX_VALUE);
- if (PMIX_SUCCESS != rc) {
- PMIX_ERROR_LOG(rc);
- }
+ cb->value = kv;
}
PMIX_POST_OBJECT(cb);
PMIX_WAKEUP_THREAD(&cb->lock);
@@ -846,9 +842,6 @@ static void _getnbfn(int fd, short flags, void *cbdata)
}
cb->cbfunc.valuefn(rc, val, cb->cbdata);
}
- if (NULL != val) {
- PMIX_VALUE_RELEASE(val);
- }
PMIX_RELEASE(cb);
return;
I grabbed the latest from Git and can no longer configure/compile, looks like pmix_rename.h.in didn't get added?
============================================================================
== Configuration complete
============================================================================
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating include/pmix_version.h
config.status: error: cannot find input file: `include/pmix_rename.h.in'
configure: /bin/sh './configure' *failed* for opal/mca/pmix/pmix3x/pmix
checking if v3.x component is to be used... yes - using the internal v3.x library
configure: WARNING: INTERNAL PMIX FAILED TO CONFIGURE
configure: error: CANNOT CONTINUE
but since none of the changes collided with mine here's my current set of leak patches for review,
It looks like your git clone is confused - can you try a fresh clone? I did and everything built fine.
Not sure about some of those leak patches - some of them might be due to the leak I patched above. You might try adding that patch to a fresh git clone of the v4.0.x branch and then recheck your patch.
Hello,
I am attacking minor memory leaks that are reported during a run of a trivial MPI test program during MPI init/finalize.
I have checked out 4.0.5rc2, repo revision: v4.0.5-19-gc82d677 to experiment upon. 'git submodule status' prints nothing.
It's compiled on a stock CentOS-7 machine (gcc-4.8.5) with a Xeon E5-2602 v2 processor. The test program runs on a node with centos-8, a dual AMD Epyc 7402 and 10GbE and 100Gb Mellanox Bluefield-2 Infiniband NICs.
For the test problem, the network doesn't matter - there is only one rank, and the problem occurs in init.
An example of the complaints from Valgrind is as follows,
I have followed the allocation/leak back to pmix_client_get.c:341 where it does this, | 341 cbfunc(rc, ival, cbdata); | 342 / ownership of the memory in ival is passed to the | 343 user in the cbfunc, so don't release it here */ where cbfunc is _value_cbfunc. This copies the values in ival to cbdata and then returns, and so ival becomes a lost block.
It appears, in my trivial test, that every call that goes to pmix_client_get.c:341 has _value_cbfunc for cbfunc. If it weren't for the explicit comment I'd just put a free(ival) right there, but I assume there's a reason for the warning and I don't know the deep internals enough... should the free just go in _value_cbfunc?
Cheers, -- Erik