Open carns opened 1 month ago
I can reproduce the problem on my machine, I'll look into it (it's weird though, the problem doesn't appear on the Github workflow even though ASAN is enabled there).
Ok so the problem is caused by the fact that the two ES that the test adds are using the primary pool, so one of them picks up the main
ULT and runs margo_cleanup
, and ends-up destroying itself in __margo_abt_xstream_destroy
. I have pushed a fix to the test that makes these ES use __pool_1__
instead of __primary__
, but this is not a fix for the problem itself, which is that we should ensure that margo_cleanup is run from the primary ES (not just in the primary pool).
I have started a PR to try to solve the issue: https://github.com/mochi-hpc/mochi-margo/pull/281
For now I simply added a unit test showcasing the problem. The problem appears on my laptop with ASAN. It doesn't appear in the github workflow, probably because containers in those workflows get only one core allocated and margo_cleanup
is always run by the primary ES.
I initially imagined a solution that consists of having __margo_abt_destroy
(the function that destroys everything Argobots-related and finalizes Argobots) create a new ES for "garbage collection". This ES would be associated with its own pool and execute a single ULT that destroys all the other ES (apart from primary) and pools. __margo_abt_destroy
would then join/free this ES before finalizing Argobots. If finalization is done correctly (i.e. if it is in the primary pool), then after the garbage collecting ES terminates, there should still be the primary ES remaining to complete the execution of __margo_abt_destroy
. Yet for some reasons this doesn't work.
Here is my code, for reference. I tried both with a join and with waiting on an eventual. Either way it looks like the calling ULT is getting destroyed, leaving memory leaks.
struct gc_abt_destroy_args {
margo_abt_t* abt;
ABT_eventual ev;
};
static void gc_abt_destroy(void* args)
{
struct gc_abt_destroy_args* gc_args = (struct gc_abt_destroy_args*)args;
margo_abt_t* a = gc_args->abt;
for (unsigned i = 0; i < a->xstreams_len; ++i) {
__margo_abt_xstream_destroy(a->xstreams + i, a);
}
free(a->xstreams);
for (unsigned i = 0; i < a->pools_len; ++i) {
__margo_abt_pool_destroy(a->pools + i, a);
}
free(a->pools);
free(a->profiling_dir);
memset(a, 0, sizeof(*a));
ABT_eventual_set(gc_args->ev, NULL, 0);
}
void __margo_abt_destroy(margo_abt_t* a)
{
ABT_pool gc_pool = ABT_POOL_NULL;
ABT_pool_create_basic(ABT_POOL_FIFO, ABT_POOL_ACCESS_MPMC, ABT_TRUE, &gc_pool);
ABT_xstream gc_es = ABT_XSTREAM_NULL;
ABT_xstream_create_basic(ABT_SCHED_BASIC, 1, &gc_pool, ABT_SCHED_CONFIG_NULL, &gc_es);
struct gc_abt_destroy_args gc_args = {
.abt = a,
.ev = ABT_EVENTUAL_NULL
};
ABT_eventual_create(0, &gc_args.ev);
ABT_thread_create_to(gc_pool, gc_abt_destroy, &gc_args, ABT_THREAD_ATTR_NULL, NULL);
ABT_eventual_wait(gc_args.ev, 0);
ABT_eventual_free(&gc_args.ev);
ABT_xstream_join(gc_es);
ABT_xstream_free(&gc_es);
if (--g_margo_num_instances == 0 && g_margo_abt_init) {
/* shut down global abt profiling if needed */
if (g_margo_abt_prof_init) {
if (g_margo_abt_prof_started) {
ABTX_prof_stop(g_margo_abt_prof_context);
g_margo_abt_prof_started = 0;
}
ABTX_prof_finalize(g_margo_abt_prof_context);
g_margo_abt_prof_init = 0;
}
ABT_finalize();
g_margo_abt_init = false;
}
}
I don't get it. I tried writing a pure Argobots reproducer (of the original setup, where you could have a non-primary ES "self-destruct"), and it works fine: main
creates 4 extra ES that all share the primary pool, I then yield until main
runs on one of these ES, and I ABT_xstream_join
and ABT_xstream_free
each extra ES one by one. I can see the main
ULT changing ES as the one it's on gets freed, no problem. But the same in Margo gets us an ASAN error.
Note also that when ASAN is disabled, everything works fine. Even if margo_cleanup
runs on a non-primary ES and tries to destroy it, it has no problem moving to another ES, and ultimately ends up running on the primary ES.
So my assumption is that it's an ASAN issue within Argobots, not a Margo issue.
I noticed this while preparing to test #278 , but this is unrelated. Just noting for now so I don't forget to come back and look at it.
With gcc 13.2 and address sanitizer on Ubuntu 24.04, the
tests/unit-tests/margo-elasticity
unit test is failing with the following log: