Closed mksully22 closed 3 years ago
verifying.
Verified. I hit the hang in Iteration4.
(gdb) info thread
Id Target Id Frame
9 Thread 0x10000265f1b0 (LWP 90430) "lt-opal_fifo" opal_atomic_rmb ()
at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
8 Thread 0x10000224f1b0 (LWP 90431) "lt-opal_fifo" 0x00000000100017e0 in opal_fifo_pop_atomic (
fifo=0x3fffff302510) at ../../opal/class/opal_fifo.h:137
7 Thread 0x100001e3f1b0 (LWP 90432) "lt-opal_fifo" 0x00000000100017e8 in opal_fifo_pop_atomic (
fifo=0x3fffff302510) at ../../opal/class/opal_fifo.h:137
6 Thread 0x100001a2f1b0 (LWP 90433) "lt-opal_fifo" opal_read_counted_pointer (value=0x100001a2e5e0,
addr=0x3fffff302540) at ../../opal/class/opal_lifo.h:83
5 Thread 0x10000161f1b0 (LWP 90434) "lt-opal_fifo" opal_atomic_rmb ()
at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
4 Thread 0x10000120f1b0 (LWP 90435) "lt-opal_fifo" opal_fifo_pop_atomic (fifo=0x3fffff302510)
at ../../opal/class/opal_fifo.h:138
3 Thread 0x100000dff1b0 (LWP 90436) "lt-opal_fifo" 0x0000000010001800 in opal_fifo_pop_atomic (
fifo=0x3fffff302510) at ../../opal/class/opal_fifo.h:138
2 Thread 0x1000009ef1b0 (LWP 90437) "lt-opal_fifo" 0x00000000100017ec in opal_fifo_pop_atomic (
fifo=0x3fffff302510) at ../../opal/class/opal_fifo.h:137
* 1 Thread 0x100000045570 (LWP 90404) "lt-opal_fifo" 0x00001000003ba0d8 in pthread_join ()
from /lib64/libpthread.so.0
Tomorrow I'll debug.
I've verified --disable-builtin-atomics
works around the problem on power8.
Do you know if you can hit the problem on Power9?
I'll check. I hope that https://github.com/open-mpi/ompi/pull/5374 resolved this on Power9, but I didn't try the stress test.
I'll check on my Power9 system as well
Power9 has run over 140 Iterations successfully. I did not disable the buildin atomics when testing on power9.
So, looks like it's just a power8 issue. Power8 does by default detect this:
checking for __atomic builtin atomics... yes
checking for processor support of __atomic builtin atomic compare-and-swap on 128-bit values... yes
checking if __int128 atomic compare-and-swap is always lock-free... yes
I didn't have as much luck on my Power9 system. With ./configure --enable-debug --prefix=/usr --mandir=/usr/share/man --sysconfdir=/etc/openmpi --enable-ipv6 --with-threads=posix --with-hwloc=/usr I still get hangs. The gdb output is below. Note: I repeated the stress test with --disable-builtin-atomics added to the configure and could not reproduce the problem.
/tmp # gdb attach 107
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
attach: No such file or directory.
Attaching to process 107
[New LWP 129]
[New LWP 130]
[New LWP 131]
[New LWP 132]
[New LWP 133]
[New LWP 134]
[New LWP 135]
[New LWP 136]
0x00007cd545b21fe0 in __clone () from /lib/ld-musl-powerpc64le.so.1
(gdb) info threads
Id Target Id Frame
* 1 LWP 107 "lt-opal_fifo" 0x00007cd545b21fe0 in __clone () from /lib/ld-musl-powerpc64le.so.1
2 LWP 129 "lt-opal_fifo" 0x00000f5f7aa11acc in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:137
3 LWP 130 "lt-opal_fifo" 0x00000f5f7aa11a74 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
4 LWP 131 "lt-opal_fifo" opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
5 LWP 132 "lt-opal_fifo" 0x00000f5f7aa11adc in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:138
6 LWP 133 "lt-opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
7 LWP 134 "lt-opal_fifo" 0x00000f5f7aa11a70 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
8 LWP 135 "lt-opal_fifo" 0x00000f5f7aa11a5c in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:126
9 LWP 136 "lt-opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
(gdb) thread apply all bt
Thread 9 (LWP 136):
#0 opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
#1 0x00000f5f7aa11b88 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:159
#2 0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#3 0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#4 0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1
Thread 8 (LWP 135):
#0 0x00000f5f7aa11a5c in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:126
#1 0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2 0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3 0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1
Thread 7 (LWP 134):
#0 0x00000f5f7aa11a70 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
#1 0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2 0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3 0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1
Thread 6 (LWP 133):
#0 opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
#1 0x00000f5f7aa11a64 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:127
#2 0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#3 0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#4 0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1
Thread 5 (LWP 132):
#0 0x00000f5f7aa11adc in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:138
#1 0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2 0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3 0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1
Thread 4 (LWP 131):
#0 opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
#1 0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2 0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3 0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1
Thread 3 (LWP 130):
#0 0x00000f5f7aa11a74 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
#1 0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2 0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3 0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1
Thread 2 (LWP 129):
#0 0x00000f5f7aa11acc in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:137
#1 0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2 0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3 0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1
Thread 1 (LWP 107):
#0 0x00007cd545b21fe0 in __clone () from /lib/ld-musl-powerpc64le.so.1
#1 0x00007cd545b138c0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#2 0x00007cd545b12374 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3 0x00007cd545b12478 in __timedwait_cp () from /lib/ld-musl-powerpc64le.so.1
Backtrace stopped: frame did not save the PC
(gdb) detach
Detaching from program: /tmp/ompi/test/class/.libs/lt-opal_fifo, process 107
(gdb) quit
/tmp # cat /proc/cpuinfo
processor : 0
cpu : POWER9, altivec supported
clock : 2300.000000MHz
revision : 2.2 (pvr 004e 1202)
processor : 1
cpu : POWER9, altivec supported
clock : 2300.000000MHz
revision : 2.2 (pvr 004e 1202)
processor : 2
cpu : POWER9, altivec supported
clock : 2300.000000MHz
revision : 2.2 (pvr 004e 1202)
processor : 3
cpu : POWER9, altivec supported
clock : 2300.000000MHz
revision : 2.2 (pvr 004e 1202)
processor : 4
cpu : POWER9, altivec supported
clock : 2300.000000MHz
revision : 2.2 (pvr 004e 1202)
===trim===
So you power9 was hanging also? My power9 that seemed to work was: RHEL 7.5, gcc 4.8.5. it's also a rev 2.2
Yes, power9 hung too. I tried it on Ubuntu 18.04, gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)
I also tried it in an Alpine container hosted on Ubuntu 18.04 and experienced the hang there too.
@mksully22 Github pro tip: use a line of 3 back ticks to start and end verbatim sections in github to make the text render better (see https://help.github.com/articles/creating-and-highlighting-code-blocks/).
@gpaulsen Any progress?
On the Webex today, @hjelmn says he'll go have a look at this. One possibility -- and @hjelmn will need to check with @bosilca -- is to disable the builtin atomics on v4.0.0.
Once we pull #5445 we should be able to even drop builtin atomics completely.
Removing critical since there is a workaround for disabling builtin atomics. I'll rerun the test on v4.0.x and see where it stands today.
@bosilca https://github.com/open-mpi/ompi/pull/5445 has been merged. Should we open an Issue to track removing buildin atomics on master before v5.0.x?
This should be resolved by dropping back to the atomic assembly on power machines in these pr's to v4.1 and v5:
v4.1.x: #8708
v5.0.x: #8710
master: https://github.com/open-mpi/ompi/pull/8649 - has some performance numbers and discussion.
it looks like it is not going back to v4. However, users can compile with the configure option --disable-builtin-atomics
to fix this on v4.0.x.
See issue:
Thank you for taking the time to submit an issue!
Background information
Running opal_fifo test intermittently hangs on Power8. Detailed debug info is provided below
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Using master branch commit level 92d89411ca0f6ae70d57270ffc5bd8b91d0992e7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Please describe the system on which you are running
Details of the problem
Using the following script to exercise the opal_fifo. The testcase will hang intermittantly. htop shows all 8 opal_fifo LWPs running at 100% CPU
Start opal_fifo stress script to reproduce:
Looking at the running processes/LWPs
Using gdb to collect some info on where the LWPs are:
Note: LWPs 1,2,4-8 are all caught in this loop (I had gdb display some of the noteworth variable values):
Note: Thread 3 is looping a bit farther down
Is there any additional information that I could collect that would be helpful to diagnose the issue?