ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

COMM_SPAWN: spawnees INIT may fail creating proc_t too early #37

Open abouteiller opened 5 years ago

abouteiller commented 5 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Issue is rather rare at moderate scale. It could become a blocker at large scale.

  1. Parent MPI program experiences a proc failure
  2. Parent MPI program issues SHRINK-SPAWN sequence
  3. Spawnees undergo MPI_INIT
  4. Spawnees receive a PMIx notification about process that failed in step 1.
  5. Spawnees create the proc_t before MPI_INIT is ready for it

The symptom is that the OBJ_RETAIN on the local convertor is done before such convertor is initialized

#5  0x00007ffff774f6b0 in ompi_proc_construct (proc=0x6c6a90) at ../../src/ompi/proc/proc.c:80
80          OBJ_RETAIN( ompi_mpi_local_convertor );

The root cause could be that ompi_interlib_declare does an opal_progress too early in MPI_INIT in order to do some LAZY_COMPLETION of PMIx/ORTE operations.

#4  <signal handler called>
#5  0x00007ffff774f6b0 in ompi_proc_construct (proc=0x6c6a90) at ../../src/ompi/proc/proc.c:80
#6  0x00007ffff774efe9 in opal_obj_run_constructors (object=0x6c6a90) at ../../src/opal/class/opal_object.h:440
#7  0x00007ffff774f100 in opal_obj_new (cls=0x7ffff7ab5b60 <ompi_proc_t_class>) at ../../src/opal/class/opal_object.h:494
#8  0x00007ffff774ef5e in opal_obj_new_debug (type=0x7ffff7ab5b60 <ompi_proc_t_class>, file=0x7ffff7853114 "../../src/ompi/proc/proc.c", line=113) at ../../src/opal/class/opal_object.h:263
#9  0x00007ffff774f915 in ompi_proc_allocate (jobid=2680356865, vpid=21, procp=0x7fffffff9f40) at ../../src/ompi/proc/proc.c:113
#10 0x00007ffff774fef2 in ompi_proc_for_name_nolock (proc_name=...) at ../../src/ompi/proc/proc.c:213
#11 0x00007ffff774ff85 in ompi_proc_for_name (proc_name=...) at ../../src/ompi/proc/proc.c:240
#12 0x00007ffff773a295 in ompi_errhandler_event_cb (fd=-1, flags=2, context=0x7fffec000ed0) at ../../src/ompi/errhandler/errhandler.c:382
#13 0x00007ffff6b3db2c in event_process_active_single_queue (activeq=0x64f3a0, base=0x64ede0) at ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1370
#14 event_process_active (base=<optimized out>) at ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1440
#15 opal_libevent2022_event_base_loop (base=0x64ede0, flags=3) at ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1644
#16 0x00007ffff6ad3423 in opal_progress_events () at ../../src/opal/runtime/opal_progress.c:191
#17 0x00007ffff6ad34dc in opal_progress () at ../../src/opal/runtime/opal_progress.c:247
#18 0x00007ffff774c904 in ompi_interlib_declare (threadlevel=0, version=0x7ffff7854d00 "4.1.0ft-ulfm-a1") at ../../src/ompi/interlib/interlib.c:122
#19 0x00007ffff7757469 in ompi_mpi_init (argc=1, argv=0x7fffffffa548, requested=0, provided=0x7fffffffa3ec, reinit_ok=false) at ../../src/ompi/runtime/ompi_mpi_init.c:557
#20 0x00007ffff77a6063 in PMPI_Init (argc=0x7fffffffa41c, argv=0x7fffffffa410) at pinit.c:67
#21 0x000000000040122f in main (argc=1, argv=0x7fffffffa548) at revshrinkkillrecover.c:91