open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

Hang in mca_mpool_hugepage_module_init() on ARM64 #3697

Open yosefe opened 7 years ago

yosefe commented 7 years ago

Background information

               Package: Open MPI mtt@hpc-arm-01 Distribution
                Open MPI: 2.1.2a1
  Open MPI repo revision: v2.1.1-55-g4d82554
   Open MPI release date: Unreleased developer copy
                Open RTE: 2.1.2a1
  Open RTE repo revision: v2.1.1-55-g4d82554
   Open RTE release date: Unreleased developer copy
                    OPAL: 2.1.2a1
      OPAL repo revision: v2.1.1-55-g4d82554
       OPAL release date: Unreleased developer copy

Details of the problem

A hang in ctxalloc test during MPI_Init. Similar hang is observed in many other tests. ctxalloc can be found here

mpirun -np 192 -mca btl_openib_warn_default_gid_prefix 0 \
  --bind-to core -mca pml ucx \
  -x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 \
  -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 \
   -mca mpi_add_procs_cutoff 100000 --map-by node \
ctxalloc 2 1500 100

stack trace:

Thread 1 (Thread 0x3ffb800a480 (LWP 27563)):
#0  0x000003ffb7ccfa74 in nanosleep () from /usr/lib64/libpthread.so.0
#1  0x000003ffb7911e0c in _opal_lifo_release_cpu () at ../opal/class/opal_lifo.h:195
#2  0x000003ffb7911e40 in opal_lifo_pop_atomic (lifo=0x620520) at ../opal/class/opal_lifo.h:210
#3  0x000003ffb7911fd0 in opal_free_list_get_st (flist=0x620520) at ../opal/class/opal_free_list.h:213
#4  0x000003ffb7911ff4 in opal_free_list_get (flist=0x620520) at ../opal/class/opal_free_list.h:225
#5  0x000003ffb7912264 in opal_rb_tree_init (tree=0x6204e0, comp=0x3ffb4671ae4 <mca_mpool_rb_hugepage_compare>) at class/opal_rb_tree.c:86
#6  0x000003ffb4671d44 in mca_mpool_hugepage_module_init (mpool=0x620410, huge_page=0x61ef30) at mpool_hugepage_module.c:107
#7  0x000003ffb4672a10 in mca_mpool_hugepage_open () at mpool_hugepage_component.c:166
#8  0x000003ffb7948ab4 in open_components (framework=0x3ffb7a34788 <opal_mpool_base_framework>) at mca_base_components_open.c:117
#9  0x000003ffb79489e4 in mca_base_framework_components_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_components_open.c:65
#10 0x000003ffb79bd2a0 in mca_mpool_base_open (flags=MCA_BASE_OPEN_DEFAULT) at base/mpool_base_frame.c:89
#11 0x000003ffb7957d1c in mca_base_framework_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_framework.c:174
#12 0x000003ffb7d59bb8 in ompi_mpi_init (argc=4, argv=0x3ffffffdc38, requested=0, provided=0x3ffffffda2c) at runtime/ompi_mpi_init.c:589
#13 0x000003ffb7d990e0 in PMPI_Init (argc=0x3ffffffdaac, argv=0x3ffffffdaa0) at pinit.c:66
#14 0x0000000000400c5c in main (argc=4, argv=0x3ffffffdc38) at ctxalloc.c:20
shamisp commented 7 years ago

Does it hang forever in nanosleep() ?

On Wed, Jun 14, 2017 at 3:20 AM, Yossi Itigin notifications@github.com wrote:

Assigned #3697 https://github.com/open-mpi/ompi/issues/3697 to @shamisp https://github.com/shamisp.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/3697#event-1122815543, or mute the thread https://github.com/notifications/unsubscribe-auth/ACIe2L0EeuusZcT5vC-y9JeExQu5CW95ks5sD5fPgaJpZM4N5ijf .

yosefe commented 7 years ago

Unlikely, this is a fixed 100ns interval"

static inline void _opal_lifo_release_cpu (void)
{
    /* NTH: there are many ways to cause the current thread to be suspended. This one
     * should work well in most cases. Another approach would be to use poll (NULL, 0, ) but
     * the interval will be forced to be in ms (instead of ns or us). Note that there
     * is a performance improvement for the lifo test when this call is made on detection
     * of contention but it may not translate into actually MPI or application performance
     * improvements. */
    static struct timespec interval = { .tv_sec = 0, .tv_nsec = 100 };
    nanosleep (&interval, NULL);
}
jsquyres commented 7 years ago

Can you provide more information about what exactly is hanging? I.e., this is clearly one stack trace of the hang, but it must be looping on something that never completes.

shamisp commented 7 years ago

@yosefe can I reproduce this with a single node ? Do I really need IB for this ? thanks

bosilca commented 7 years ago

You don't need IB for this, just force the load of hugepage mpool. Also, this seems to happen in a very peculiar place, during a lifo_pop operation, which in this particualr context use the conditional load/store of the ARM64. As our tests pass on ARM64, I don't think the issue is in the atomic itself, but instead in the way the rb tree is initialized or how the freelist is grown in this particular instance.

shamisp commented 7 years ago

is this a multi-thread run ? how many threads ?

shamisp commented 7 years ago

@yosefe , can you please replace https://github.com/open-mpi/ompi/blob/d7ebcca93fc86796e0d43997935539b1e922a1f1/opal/class/opal_lifo.h#L226

with opal_atomic_mb()

yosefe commented 7 years ago

@shamisp it's looping in https://github.com/open-mpi/ompi/blob/d7ebcca93fc86796e0d43997935539b1e922a1f1/opal/class/opal_lifo.h#L224, changing to atomic_mb() does not help. it can be reproduced on 1 node. disabling hugepage mpool moves the hange to another place - so it's something more fundamental:

(gdb) bt
#0  0x000003ffb7ccfa78 in nanosleep () from /usr/lib64/libpthread.so.0
#1  0x000003ffb7911e0c in _opal_lifo_release_cpu () at ../opal/class/opal_lifo.h:195
#2  0x000003ffb7911e40 in opal_lifo_pop_atomic (lifo=0x3ffb7a3b778 <mca_mpool_base_tree+64>) at ../opal/class/opal_lifo.h:210
#3  0x000003ffb7911fd0 in opal_free_list_get_st (flist=0x3ffb7a3b778 <mca_mpool_base_tree+64>) at ../opal/class/opal_free_list.h:213
#4  0x000003ffb7911ff4 in opal_free_list_get (flist=0x3ffb7a3b778 <mca_mpool_base_tree+64>) at ../opal/class/opal_free_list.h:225
#5  0x000003ffb7912264 in opal_rb_tree_init (tree=0x3ffb7a3b738 <mca_mpool_base_tree>, comp=0x3ffb79bdba0 <mca_mpool_base_tree_node_compare>) at class/opal_rb_tree.c:86
#6  0x000003ffb79bde28 in mca_mpool_base_tree_init () at base/mpool_base_tree.c:87
#7  0x000003ffb79bd378 in mca_mpool_base_open (flags=MCA_BASE_OPEN_DEFAULT) at base/mpool_base_frame.c:102
#8  0x000003ffb7957d1c in mca_base_framework_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_framework.c:174
#9  0x000003ffb7d59bc0 in ompi_mpi_init (argc=4, argv=0x3ffffffe5c8, requested=0, provided=0x3ffffffe3bc) at runtime/ompi_mpi_init.c:589
#10 0x000003ffb7d990e8 in PMPI_Init (argc=0x3ffffffe43c, argv=0x3ffffffe430) at pinit.c:66
#11 0x0000000000400c5c in main (argc=4, argv=0x3ffffffe5c8) at ctxalloc.c:20
yosefe commented 7 years ago

command line on single node:

mpirun -np 1 -mca btl self --bind-to core --display-map -mca pml ob1 ./ctxalloc 2 1500 100
kawashima-fj commented 7 years ago

I can reproduce the hang on my ARM64 machine.

Open MPI: openmpi-v2.x-201706150321-b562082 (nightly snapshot tarball) CPU: Cavium Thunder X (ARMv8 (64-bit)) OS: CentOS Linux release 7.2.1603 (AltArch) Compiler: GCC 4.8.5 (in CentOS) configure option: --enable-debug or --disable-debug compiler option: -O0

I cannot reproduce the hang with -O1 or -O2 (either --enable-debug or --disable-debug).

Test programs in the Open MPI source tree also hang.

test/class/opal_lifo

(gdb) bt
#0  0x0000ffffb7d009a8 in __nanosleep_nocancel () from /lib64/libpthread.so.0
#1  0x0000000000401384 in _opal_lifo_release_cpu ()
    at ../../../../opal/class/opal_lifo.h:195
#2  0x00000000004013b8 in opal_lifo_pop_atomic (lifo=0xffffffffed80)
    at ../../../../opal/class/opal_lifo.h:210
#3  0x00000000004014f4 in thread_test (arg=0xffffffffed80)
    at ../../../../test/class/opal_lifo.c:50
#4  0x0000000000401ab8 in main (argc=1, argv=0xfffffffff018)
    at ../../../../test/class/opal_lifo.c:147

test/class/opal_fifo

(gdb) bt
#0  opal_fifo_pop_atomic (fifo=0xffffffffed68)
    at ../../../../opal/class/opal_fifo.h:238
#1  0x00000000004015d4 in thread_test (arg=0xffffffffed68)
    at ../../../../test/class/opal_fifo.c:51
#2  0x0000000000401e3c in main (argc=1, argv=0xfffffffff018)
    at ../../../../test/class/opal_fifo.c:184

test/class/ompi_rb_tree

(gdb) bt
#0  0x0000ffffb7a589a8 in __nanosleep_nocancel () from /lib64/libpthread.so.0
#1  0x0000ffffb7bfb338 in _opal_lifo_release_cpu ()
    at ../../../opal/class/opal_lifo.h:195
#2  0x0000ffffb7bfb36c in opal_lifo_pop_atomic (lifo=0xffffffffec90)
    at ../../../opal/class/opal_lifo.h:210
#3  0x0000ffffb7bfb4fc in opal_free_list_get_st (flist=0xffffffffec90)
    at ../../../opal/class/opal_free_list.h:213
#4  0x0000ffffb7bfb520 in opal_free_list_get (flist=0xffffffffec90)
    at ../../../opal/class/opal_free_list.h:225
#5  0x0000ffffb7bfb790 in opal_rb_tree_init (tree=0xffffffffec50, comp=0x4016c4 <comp_fn>)
    at ../../../opal/class/opal_rb_tree.c:86
#6  0x0000000000401c04 in test1 ()
    at ../../../../test/class/ompi_rb_tree.c:145
#7  0x0000000000402850 in main (argc=1, argv=0xfffffffff018)
    at ../../../../test/class/ompi_rb_tree.c:408

In the opal_lifo case, it is looping in do-while loop of the opal_lifo_pop_atomic function (OPAL_HAVE_ATOMIC_LLSC_PTR == 1 case).

In the opal_fifo case, it is looping in do-while loop of the opal_fifo_pop_atomic function (OPAL_HAVE_ATOMIC_LLSC_PTR == 1 case). Not continue case.

hjelmn commented 7 years ago

Does it run with --disable-builtin-atomics?

kawashima-fj commented 7 years ago

@hjelmn Builtin atomics are disabled by default in v2.x branch. I enabled BUILTIN_GCC (__atomic_*) by --enable-builtin-atomics. Then make check (which includes three test programs above) passed without hang. Both -O0 and -O2 were tested.

shamisp commented 7 years ago

@yosefe - what compiler version is used ?

yosefe commented 7 years ago
$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
kawashima-fj commented 7 years ago

@PHHargrove reported similar issues on the devel list.

Regarding make check, totally reported version/platform/compiler/test patterns of fail/hang are:

A common point of ARM64 and PPC64(LE) is OPAL_HAVE_ATOMIC_LLSC_PTR == 1 (grep -r OPAL_HAVE_ATOMIC_LLSC_ opal/include/opal/sys shows it). But I'm wondering why the following patterns succeed. ARM64 and PPC64 show opposite results regarding --enable-builtin-atomics.

--disable-builtin-atomics is default in Open MPI 2.1 series. --enable-builtin-atomics is default in Open MPI 3.0 series.

jsquyres commented 7 years ago

Added a v3.0.0 milestone since @PHHargrove saw this on 3.0.0rc1, per the above comment.

shamisp commented 7 years ago

I will try to reproduce it on one of my systems.

shamisp commented 7 years ago

I can confirm that the problem only show ups in "-O0" mode.

hjelmn commented 7 years ago

I know what is happening. With -O0 the opalatomic functions are not inlined. That makes the LL/SC atomics a function call. This will likely* cause livelock with the LL/SC fifo/lifo implementations as it increases the chance that a read will cancel the LL reservation. The correct fix is to force those atomics to always be inlined. I will make the change and see if it fixes the issue.

kawashima-fj commented 7 years ago

@hjelmn I confirmed the current status on AArch64.

make check result:

enable + -O0 enable + -O3 disable + -O0 disable + -O3
2.0.3 OK OK OK OK
2.1.1 OK OK hang-up OK
3.0.0rc1 OK OK hang-up OK
2.0.x nightly OK OK OK OK
2.1.x nightly OK OK hang-up OK
3.0.x nightly OK OK hang-up OK

All hang-up occur in the test/class directory (opal_fifo etc.).

As you said, bad condition is: Open MPI 2.1 or higher + --disable-builtin-atomics + -O0

Open MPI 2.0.x does not have this issue because it does not have opal/include/opal/sys/arm64 code and LL/SC fifo/lifo implementations are not used.

I'll confirm performance difference of --enable-builtin-atomics and --disable-builtin-atomics.

kawashima-fj commented 7 years ago

@hjelmn I run osu_latency and osu_latency_mt on my ARM machine and found --disable-builtin-atomics is slightly faster than --enable-builtin-atomics on osu_latency_mt.

osu_latency latency (us):

size (byte) --enable-builtin-atomics --disable-builtin-atomics
0 0.88 0.87
1 1.07 1.08
2 1.08 1.08
4 1.09 1.08
8 1.11 1.11
16 1.13 1.13
32 1.13 1.13
64 1.16 1.15

osu_latency_mt latency (us):

size (byte) --enable-builtin-atomics --disable-builtin-atomics
0 4.45 4.44
1 4.73 4.70
2 4.73 4.70
4 4.74 4.71
8 4.76 4.74
16 4.78 4.76
32 4.79 4.78
64 4.83 4.80

Each value is a median value of 10 times runs.

If you need more data, let me know.

kawashima-fj commented 7 years ago

@hjelmn You asked me which commits I backported into v2.0.2-based Fujitsu MPI in the f2f meeting. I backported the following commits. Open MPI 2.0 can run on AArch64 without these commits but I backported for better support of AArch64.

shamisp commented 7 years ago

I have tried gcc gcc/6.1.0 and gcc/7.1.0 and I still observe the same issue.

hjelmn commented 7 years ago

@kawashima-fj #3988 should fix the hang.

Its no surprise that the built-in atomics version is slower. The LL/SC lifo is significantly faster than the compare-and-swap version.

hjelmn commented 7 years ago

This is the correct one. Think we have a fix.

bwbarrett commented 7 years ago

@hjelmn / @jjhursey, where are we on fixing this issue? What branches are still impacted?

jsquyres commented 6 years ago

Per 2018-03 Dallas face-to-face meeting, this is still happening to Fujitsu on ARMv8. It was discussed in the Dallas meeting; @hjelmn is looking into this.

shamisp commented 6 years ago

Can we have a bit more details on this ? What AMO is broken ? Does it happen only with built-in AMOs ?