mochi-hpc / mochi-margo

Argobots bindings for the Mercury RPC library
Other
21 stars 8 forks source link

ASAN issue at finalization time when multiple ES use the primary pool #280

Open carns opened 1 month ago

carns commented 1 month ago

I noticed this while preparing to test #278 , but this is unrelated. Just noting for now so I don't forget to come back and look at it.

With gcc 13.2 and address sanitizer on Ubuntu 24.04, the tests/unit-tests/margo-elasticity unit test is failing with the following log:

Running test suite with seed 0xa4c8035a...
/margo/add_pool_from_json            [ OK    ] [ 1.61194215 / 1.01860575 CPU ]
/margo/add_pool_external             [ OK    ] [ 1.61572613 / 1.02395611 CPU ]
/margo/remove_pool                   [ OK    ] [ 1.61638014 / 1.02584012 CPU ]
/margo/add_xstream_from_json         [ ERROR ]
[error] Could not create new xstream: name ("my_es") already used
[error] Could not create new xstream: name ("my_es") already used
[error] xstream definition in configuration must be of type object
AddressSanitizer:DEADLYSIGNAL
=================================================================
==701858==ERROR: AddressSanitizer: SEGV on unknown address 0x778d74a09400 (pc 0x778d8b9047b3 bp 0x7ffce9316120 sp 0x7ffce9316010 T0)
==701858==The signal is caused by a WRITE memory access.
    #0 0x778d8b9047b3 in __margo_abt_xstream_destroy ../src/margo-abt-config.c:761
    #1 0x778d8b9071eb in __margo_abt_destroy ../src/margo-abt-config.c:1187
    #2 0x778d8b8cc8bc in margo_cleanup ../src/margo-core.c:183
    #3 0x778d8b8cdc26 in margo_finalize ../src/margo-core.c:281
    #4 0x5bce4793b3af in add_xstream_from_json ../tests/unit-tests/margo-elasticity.c:360
    #5 0x5bce4792f072 in munit_test_runner_exec ../tests/unit-tests/munit/munit.c:1195
    #6 0x5bce4792fc65 in munit_test_runner_run_test_with_params ../tests/unit-tests/munit/munit.c:1356
    #7 0x5bce479317fd in munit_test_runner_run_test ../tests/unit-tests/munit/munit.c:1584
    #8 0x5bce4793251f in munit_test_runner_run_suite ../tests/unit-tests/munit/munit.c:1677
    #9 0x5bce479327eb in munit_test_runner_run ../tests/unit-tests/munit/munit.c:1696
    #10 0x5bce479357ea in munit_suite_main_custom ../tests/unit-tests/munit/munit.c:2026
    #11 0x5bce47935caa in munit_suite_main ../tests/unit-tests/munit/munit.c:2054
    #12 0x5bce4793dad1 in main ../tests/unit-tests/margo-elasticity.c:557
    #13 0x778d8b62a1c9 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #14 0x778d8b62a28a in __libc_start_main_impl ../csu/libc-start.c:360
    #15 0x5bce4792bfa4 in _start (/home/carns/working/src/mochi/mochi-margo/build/tests/unit-tests/.libs/margo-elasticity+0x6fa4) (BuildId: 5e913c0bdda233a0efafe3483c63df9191535245)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ../src/margo-abt-config.c:761 in __margo_abt_xstream_destroy
==701858==ABORTING
Error: child killed by signal 6 (Aborted)
/margo/add_xstream_external          [ OK    ] [ 1.71324961 / 1.61892099 CPU ]
/margo/remove_xstream                [ OK    ] [ 1.51757969 / 1.02752020 CPU ]
5 of 6 (83%) tests successful, 0 (0%) test skipped.
FAIL tests/unit-tests/margo-elasticity (exit status: 1)
mdorier commented 1 month ago

I can reproduce the problem on my machine, I'll look into it (it's weird though, the problem doesn't appear on the Github workflow even though ASAN is enabled there).

mdorier commented 1 month ago

Ok so the problem is caused by the fact that the two ES that the test adds are using the primary pool, so one of them picks up the main ULT and runs margo_cleanup, and ends-up destroying itself in __margo_abt_xstream_destroy. I have pushed a fix to the test that makes these ES use __pool_1__ instead of __primary__, but this is not a fix for the problem itself, which is that we should ensure that margo_cleanup is run from the primary ES (not just in the primary pool).

mdorier commented 1 month ago

I have started a PR to try to solve the issue: https://github.com/mochi-hpc/mochi-margo/pull/281 For now I simply added a unit test showcasing the problem. The problem appears on my laptop with ASAN. It doesn't appear in the github workflow, probably because containers in those workflows get only one core allocated and margo_cleanup is always run by the primary ES.

I initially imagined a solution that consists of having __margo_abt_destroy (the function that destroys everything Argobots-related and finalizes Argobots) create a new ES for "garbage collection". This ES would be associated with its own pool and execute a single ULT that destroys all the other ES (apart from primary) and pools. __margo_abt_destroy would then join/free this ES before finalizing Argobots. If finalization is done correctly (i.e. if it is in the primary pool), then after the garbage collecting ES terminates, there should still be the primary ES remaining to complete the execution of __margo_abt_destroy. Yet for some reasons this doesn't work.

Here is my code, for reference. I tried both with a join and with waiting on an eventual. Either way it looks like the calling ULT is getting destroyed, leaving memory leaks.

  struct gc_abt_destroy_args {
      margo_abt_t* abt;
      ABT_eventual ev;
  };

  static void gc_abt_destroy(void* args)
  {
      struct gc_abt_destroy_args* gc_args = (struct gc_abt_destroy_args*)args;
      margo_abt_t* a = gc_args->abt;
      for (unsigned i = 0; i < a->xstreams_len; ++i) {
          __margo_abt_xstream_destroy(a->xstreams + i, a);
      }
      free(a->xstreams);
      for (unsigned i = 0; i < a->pools_len; ++i) {
          __margo_abt_pool_destroy(a->pools + i, a);
      }
      free(a->pools);
      free(a->profiling_dir);
      memset(a, 0, sizeof(*a));
      ABT_eventual_set(gc_args->ev, NULL, 0);
  }

  void __margo_abt_destroy(margo_abt_t* a)
  {
      ABT_pool gc_pool = ABT_POOL_NULL;
      ABT_pool_create_basic(ABT_POOL_FIFO, ABT_POOL_ACCESS_MPMC, ABT_TRUE, &gc_pool);

      ABT_xstream gc_es = ABT_XSTREAM_NULL;
      ABT_xstream_create_basic(ABT_SCHED_BASIC, 1, &gc_pool, ABT_SCHED_CONFIG_NULL,  &gc_es);

      struct gc_abt_destroy_args gc_args = {
          .abt = a,
          .ev = ABT_EVENTUAL_NULL
      };

      ABT_eventual_create(0, &gc_args.ev);

      ABT_thread_create_to(gc_pool, gc_abt_destroy, &gc_args, ABT_THREAD_ATTR_NULL, NULL);

      ABT_eventual_wait(gc_args.ev, 0);
      ABT_eventual_free(&gc_args.ev);

      ABT_xstream_join(gc_es);
      ABT_xstream_free(&gc_es);

      if (--g_margo_num_instances == 0 && g_margo_abt_init) {
          /* shut down global abt profiling if needed */
          if (g_margo_abt_prof_init) {
              if (g_margo_abt_prof_started) {
                  ABTX_prof_stop(g_margo_abt_prof_context);
                  g_margo_abt_prof_started = 0;
              }
              ABTX_prof_finalize(g_margo_abt_prof_context);
              g_margo_abt_prof_init = 0;
          }
          ABT_finalize();
          g_margo_abt_init = false;
      }
  }
mdorier commented 1 month ago

I don't get it. I tried writing a pure Argobots reproducer (of the original setup, where you could have a non-primary ES "self-destruct"), and it works fine: main creates 4 extra ES that all share the primary pool, I then yield until main runs on one of these ES, and I ABT_xstream_join and ABT_xstream_free each extra ES one by one. I can see the main ULT changing ES as the one it's on gets freed, no problem. But the same in Margo gets us an ASAN error.

Note also that when ASAN is disabled, everything works fine. Even if margo_cleanup runs on a non-primary ES and tries to destroy it, it has no problem moving to another ES, and ultimately ends up running on the primary ES.

So my assumption is that it's an ASAN issue within Argobots, not a Margo issue.