open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

Running opal_fifo test intermittently hangs on Power8 #5470

Closed mksully22 closed 3 years ago

mksully22 commented 6 years ago

Thank you for taking the time to submit an issue!

Background information

Running opal_fifo test intermittently hangs on Power8. Detailed debug info is provided below

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Using master branch commit level 92d89411ca0f6ae70d57270ffc5bd8b91d0992e7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone https://github.com/open-mpi/ompi.git
cd ompi
./autogen.pl
./configure --enable-debug --prefix=/usr --mandir=/usr/share/man --sysconfdir=/etc/$pkgname --enable-ipv6 --with-threads=posix --with-hwloc=/usr
make
make check
Run opal_fifo stress script:
#!/bin/bash
i=0
while :
do
        echo "Iteration: $i"
        ./test/class/opal_fifo
        ((i++))
        sleep 1
done

Please describe the system on which you are running


Details of the problem

Using the following script to exercise the opal_fifo. The testcase will hang intermittantly. htop shows all 8 opal_fifo LWPs running at 100% CPU

Start opal_fifo stress script to reproduce:

#!/bin/bash
i=0
while :
do
        echo "Iteration: $i"
        ./test/class/opal_fifo
        ((i++))
        sleep 1
done

Looking at the running processes/LWPs

root@p82qvirt:/home/mksully# ps -eLf | grep fifo
mksully   27439  21680  27439  0    9 16:20 pts/0    00:00:00 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27466 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27467 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27468 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27469 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27470 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27471 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27472 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
mksully   27439  21680  27473 99    9 16:20 pts/0    00:01:15 /tmp/ompi/test/class/.libs/opal_fifo
root      27648  50987  27648  0    1 16:21 pts/3    00:00:00 grep --color=auto fifo

Using gdb to collect some info on where the LWPs are:

root@p82qvirt:/home/mksully# gdb attach 27439
GNU gdb (Ubuntu 8.0.1-0ubuntu1) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
attach: No such file or directory.
Attaching to process 27439
[New LWP 27466]
[New LWP 27467]
[New LWP 27468]
[New LWP 27469]
[New LWP 27470]
[New LWP 27471]
[New LWP 27472]
[New LWP 27473]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x0000796665aa9db4 in __pthread_join (threadid=133480637264256, thread_return=0x7fffeeb23ed8) at pthread_join.c:90
90      pthread_join.c: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x796665cb55c0 (LWP 27439) "opal_fifo" 0x0000796665aa9db4 in __pthread_join (threadid=133480637264256, thread_return=0x7fffeeb23ed8)
    at pthread_join.c:90
  2    Thread 0x796661f7f180 (LWP 27466) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
  3    Thread 0x79666277f180 (LWP 27467) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:59
  4    Thread 0x796662f7f180 (LWP 27468) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  5    Thread 0x79666377f180 (LWP 27469) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
  6    Thread 0x796663f7f180 (LWP 27470) "opal_fifo" 0x000007e12cbf1a34 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40)
    at ../../opal/class/opal_fifo.h:137
  7    Thread 0x79666577f180 (LWP 27471) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
  8    Thread 0x796664f7f180 (LWP 27472) "opal_fifo" 0x000007e12cbf1a30 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40)
    at ../../opal/class/opal_fifo.h:137
  9    Thread 0x79666477f180 (LWP 27473) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
(gdb) thread apply all bt

Thread 9 (Thread 0x79666477f180 (LWP 27473)):
#0  opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
#1  0x000007e12cbf19d8 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:127
#2  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#3  0x0000796665aa8710 in start_thread (arg=0x79666477f180) at pthread_create.c:465
#4  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 8 (Thread 0x796664f7f180 (LWP 27472)):
#0  0x000007e12cbf1a30 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:137
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x796664f7f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 7 (Thread 0x79666577f180 (LWP 27471)):
#0  opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x79666577f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 6 (Thread 0x796663f7f180 (LWP 27470)):
#0  0x000007e12cbf1a34 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:137
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x796663f7f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 5 (Thread 0x79666377f180 (LWP 27469)):
#0  opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x79666377f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 4 (Thread 0x796662f7f180 (LWP 27468)):
#0  opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
#1  0x000007e12cbf1a6c in opal_read_counted_pointer (value=0x796662f7e590, addr=0x7fffeeb23f70) at ../../opal/class/opal_lifo.h:82
#2  opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:138
#3  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#4  0x0000796665aa8710 in start_thread (arg=0x796662f7f180) at pthread_create.c:465
#5  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 3 (Thread 0x79666277f180 (LWP 27467)):
#0  opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:59
#1  0x000007e12cbf1afc in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:159
#2  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#3  0x0000796665aa8710 in start_thread (arg=0x79666277f180) at pthread_create.c:465
#4  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 2 (Thread 0x796661f7f180 (LWP 27466)):
#0  opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
#1  0x000007e12cbf1f2c in thread_test_exhaust (arg=0x7fffeeb23f40) at opal_fifo.c:80
#2  0x0000796665aa8710 in start_thread (arg=0x796661f7f180) at pthread_create.c:465
#3  0x00007966659e35a0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82

Thread 1 (Thread 0x796665cb55c0 (LWP 27439)):
#0  0x0000796665aa9db4 in __pthread_join (threadid=133480637264256, thread_return=0x7fffeeb23ed8) at pthread_join.c:90
#1  0x000007e12cbf2964 in main (argc=1, argv=0x7fffeeb24448) at opal_fifo.c:227
(gdb)

(gdb) disassemble /s opal_atomic_rmb
Dump of assembler code for function opal_atomic_rmb:
../../opal/include/opal/sys/gcc_builtin/atomic.h:
59      {
   0x000007e12cbf115c <+0>:     std     r31,-8(r1)
   0x000007e12cbf1160 <+4>:     stdu    r1,-48(r1)
   0x000007e12cbf1164 <+8>:     mr      r31,r1

60          __atomic_thread_fence (__ATOMIC_ACQUIRE);
   0x000007e12cbf1168 <+12>:    lwsync

61      }
   0x000007e12cbf116c <+16>:    nop
   0x000007e12cbf1170 <+20>:    addi    r1,r31,48
   0x000007e12cbf1174 <+24>:    ld      r31,-8(r1)
   0x000007e12cbf1178 <+28>:    blr
   0x000007e12cbf117c <+32>:    .long 0x0
   0x000007e12cbf1180 <+36>:    .long 0x0
   0x000007e12cbf1184 <+40>:    .long 0x1000180
End of assembler dump.
(gdb) disassemble /s opal_fifo_pop_atomic
Dump of assembler code for function opal_fifo_pop_atomic:
../../opal/class/opal_fifo.h:
119     {
   0x000007e12cbf1948 <+0>:     addis   r2,r12,2
   0x000007e12cbf194c <+4>:     addi    r2,r2,26040
   0x000007e12cbf1950 <+8>:     mflr    r0
   0x000007e12cbf1954 <+12>:    std     r0,16(r1)
   0x000007e12cbf1958 <+16>:    std     r31,-8(r1)
   0x000007e12cbf195c <+20>:    stdu    r1,-176(r1)
   0x000007e12cbf1960 <+24>:    mr      r31,r1
   0x000007e12cbf1964 <+28>:    std     r3,40(r31)
   0x000007e12cbf1968 <+32>:    ld      r9,-28688(r13)
   0x000007e12cbf196c <+36>:    std     r9,152(r31)
   0x000007e12cbf1970 <+40>:    li      r9,0

120         opal_list_item_t *item, *next, *ghost = &fifo->opal_fifo_ghost;
   0x000007e12cbf1974 <+44>:    ld      r9,40(r31)
   0x000007e12cbf1978 <+48>:    addi    r9,r9,80
   0x000007e12cbf197c <+52>:    std     r9,56(r31)

121         opal_counted_pointer_t head, tail;
122
123         opal_read_counted_pointer (&fifo->opal_fifo_head, &head);
   0x000007e12cbf1980 <+56>:    ld      r9,40(r31)
   0x000007e12cbf1984 <+60>:    addi    r9,r9,48
   0x000007e12cbf1988 <+64>:    std     r9,80(r31)
   0x000007e12cbf198c <+68>:    addi    r9,r31,112
   0x000007e12cbf1990 <+72>:    std     r9,88(r31)

../../opal/class/opal_lifo.h:
81          value->data.counter = addr->data.counter;
   0x000007e12cbf1994 <+76>:    ld      r9,80(r31)
   0x000007e12cbf1998 <+80>:    ld      r10,0(r9)
   0x000007e12cbf199c <+84>:    ld      r9,88(r31)
   0x000007e12cbf19a0 <+88>:    std     r10,0(r9)

82          opal_atomic_rmb ();
   0x000007e12cbf19a4 <+92>:    bl      0x7e12cbf115c <opal_atomic_rmb>

83          value->data.item = addr->data.item;
   0x000007e12cbf19a8 <+96>:    ld      r9,80(r31)
   0x000007e12cbf19ac <+100>:   ld      r10,8(r9)
   0x000007e12cbf19b0 <+104>:   ld      r9,88(r31)
   0x000007e12cbf19b4 <+108>:   std     r10,8(r9)

../../opal/class/opal_fifo.h:
126             tail.value = fifo->opal_fifo_tail.value;
   0x000007e12cbf19b8 <+112>:   ld      r9,40(r31)
   0x000007e12cbf19bc <+116>:   addi    r9,r9,64
   0x000007e12cbf19c0 <+120>:   lxvd2x  vs0,0,r9
   0x000007e12cbf19c4 <+124>:   xxswapd vs12,vs0
   0x000007e12cbf19c8 <+128>:   addi    r9,r31,128
   0x000007e12cbf19cc <+132>:   xxswapd vs0,vs12
   0x000007e12cbf19d0 <+136>:   stxvd2x vs0,0,r9

127             opal_atomic_rmb ();
   0x000007e12cbf19d4 <+140>:   bl      0x7e12cbf115c <opal_atomic_rmb>

128
129             item = (opal_list_item_t *) head.data.item;
   0x000007e12cbf19d8 <+144>:   ld      r9,120(r31)
   0x000007e12cbf19dc <+148>:   std     r9,64(r31)

130             next = (opal_list_item_t *) item->opal_list_next;
   0x000007e12cbf19e0 <+152>:   ld      r9,64(r31)
   0x000007e12cbf19e4 <+156>:   ld      r9,40(r9)
   0x000007e12cbf19e8 <+160>:   std     r9,72(r31)
---Type <return> to continue, or q <return> to quit---

131
132             if (ghost == tail.data.item && ghost == item) {
   0x000007e12cbf19ec <+164>:   ld      r9,136(r31)
   0x000007e12cbf19f0 <+168>:   ld      r10,56(r31)
   0x000007e12cbf19f4 <+172>:   cmpd    cr7,r10,r9
   0x000007e12cbf19f8 <+176>:   bne     cr7,0x7e12cbf1a14 <opal_fifo_pop_atomic+204>
   0x000007e12cbf19fc <+180>:   ld      r10,56(r31)
   0x000007e12cbf1a00 <+184>:   ld      r9,64(r31)
   0x000007e12cbf1a04 <+188>:   cmpd    cr7,r10,r9
   0x000007e12cbf1a08 <+192>:   bne     cr7,0x7e12cbf1a14 <opal_fifo_pop_atomic+204>

133                 return NULL;
   0x000007e12cbf1a0c <+196>:   li      r9,0
   0x000007e12cbf1a10 <+200>:   b       0x7e12cbf1b38 <opal_fifo_pop_atomic+496>

134             }
135
136             /* the head or next pointer are in an inconsistent state. keep looping. */
137             if (tail.data.item != item && ghost != tail.data.item && ghost == next) {
   0x000007e12cbf1a14 <+204>:   ld      r9,136(r31)
   0x000007e12cbf1a18 <+208>:   ld      r10,64(r31)
   0x000007e12cbf1a1c <+212>:   cmpd    cr7,r10,r9
   0x000007e12cbf1a20 <+216>:   beq     cr7,0x7e12cbf1a80 <opal_fifo_pop_atomic+312>
   0x000007e12cbf1a24 <+220>:   ld      r9,136(r31)
   0x000007e12cbf1a28 <+224>:   ld      r10,56(r31)
   0x000007e12cbf1a2c <+228>:   cmpd    cr7,r10,r9
   0x000007e12cbf1a30 <+232>:   beq     cr7,0x7e12cbf1a80 <opal_fifo_pop_atomic+312>
   0x000007e12cbf1a34 <+236>:   ld      r10,56(r31)
   0x000007e12cbf1a38 <+240>:   ld      r9,72(r31)
   0x000007e12cbf1a3c <+244>:   cmpd    cr7,r10,r9
   0x000007e12cbf1a40 <+248>:   bne     cr7,0x7e12cbf1a80 <opal_fifo_pop_atomic+312>

138                 opal_read_counted_pointer (&fifo->opal_fifo_head, &head);
   0x000007e12cbf1a44 <+252>:   ld      r9,40(r31)
   0x000007e12cbf1a48 <+256>:   addi    r9,r9,48
   0x000007e12cbf1a4c <+260>:   std     r9,96(r31)
   0x000007e12cbf1a50 <+264>:   addi    r9,r31,112
   0x000007e12cbf1a54 <+268>:   std     r9,104(r31)

../../opal/class/opal_lifo.h:
81          value->data.counter = addr->data.counter;
   0x000007e12cbf1a58 <+272>:   ld      r9,96(r31)
   0x000007e12cbf1a5c <+276>:   ld      r10,0(r9)
   0x000007e12cbf1a60 <+280>:   ld      r9,104(r31)
   0x000007e12cbf1a64 <+284>:   std     r10,0(r9)

82          opal_atomic_rmb ();
   0x000007e12cbf1a68 <+288>:   bl      0x7e12cbf115c <opal_atomic_rmb>

83          value->data.item = addr->data.item;
   0x000007e12cbf1a6c <+292>:   ld      r9,96(r31)
   0x000007e12cbf1a70 <+296>:   ld      r10,8(r9)
   0x000007e12cbf1a74 <+300>:   ld      r9,104(r31)
   0x000007e12cbf1a78 <+304>:   std     r10,8(r9)

../../opal/class/opal_fifo.h:
139                 continue;
   0x000007e12cbf1a7c <+308>:   b       0x7e12cbf1aa8 <opal_fifo_pop_atomic+352>

140             }
141
142             /* try popping the head */
143             if (opal_update_counted_pointer (&fifo->opal_fifo_head, &head, next)) {
   0x000007e12cbf1a80 <+312>:   ld      r9,40(r31)
   0x000007e12cbf1a84 <+316>:   addi    r9,r9,48
   0x000007e12cbf1a88 <+320>:   addi    r10,r31,112
---Type <return> to continue, or q <return> to quit---
   0x000007e12cbf1a8c <+324>:   ld      r5,72(r31)
   0x000007e12cbf1a90 <+328>:   mr      r4,r10
   0x000007e12cbf1a94 <+332>:   mr      r3,r9
   0x000007e12cbf1a98 <+336>:   bl      0x7e12cbf1710 <opal_update_counted_pointer+8>
   0x000007e12cbf1a9c <+340>:   mr      r9,r3
   0x000007e12cbf1aa0 <+344>:   cmpdi   cr7,r9,0
   0x000007e12cbf1aa4 <+348>:   bne     cr7,0x7e12cbf1aac <opal_fifo_pop_atomic+356>

126             tail.value = fifo->opal_fifo_tail.value;
   0x000007e12cbf1aa8 <+352>:   b       0x7e12cbf19b8 <opal_fifo_pop_atomic+112>

144                 break;
   0x000007e12cbf1aac <+356>:   nop

145             }
146         } while (1);
147
148         opal_atomic_wmb ();
   0x000007e12cbf1ab0 <+360>:   bl      0x7e12cbf1188 <opal_atomic_wmb>

149
150         /* check for tail and head consistency */
151         if (ghost == next) {
   0x000007e12cbf1ab4 <+364>:   ld      r10,56(r31)
   0x000007e12cbf1ab8 <+368>:   ld      r9,72(r31)
   0x000007e12cbf1abc <+372>:   cmpd    cr7,r10,r9
   0x000007e12cbf1ac0 <+376>:   bne     cr7,0x7e12cbf1b28 <opal_fifo_pop_atomic+480>

152             /* the head was just set to &fifo->opal_fifo_ghost. try to update the tail as well */
153             if (!opal_update_counted_pointer (&fifo->opal_fifo_tail, &tail, ghost)) {
   0x000007e12cbf1ac4 <+380>:   ld      r9,40(r31)
   0x000007e12cbf1ac8 <+384>:   addi    r9,r9,64
   0x000007e12cbf1acc <+388>:   addi    r10,r31,128
   0x000007e12cbf1ad0 <+392>:   ld      r5,56(r31)
   0x000007e12cbf1ad4 <+396>:   mr      r4,r10
   0x000007e12cbf1ad8 <+400>:   mr      r3,r9
   0x000007e12cbf1adc <+404>:   bl      0x7e12cbf1710 <opal_update_counted_pointer+8>
   0x000007e12cbf1ae0 <+408>:   mr      r9,r3
   0x000007e12cbf1ae4 <+412>:   xori    r9,r9,1
   0x000007e12cbf1ae8 <+416>:   clrlwi  r9,r9,24
   0x000007e12cbf1aec <+420>:   cmpdi   cr7,r9,0
   0x000007e12cbf1af0 <+424>:   beq     cr7,0x7e12cbf1b28 <opal_fifo_pop_atomic+480>

154                 /* tail was changed by a push operation. wait for the item's next pointer to be se then
155                  * update the head */
156
157                 /* wait for next pointer to be updated by push */
158                 while (ghost == item->opal_list_next) {
   0x000007e12cbf1af4 <+428>:   b       0x7e12cbf1afc <opal_fifo_pop_atomic+436>

159                     opal_atomic_rmb ();
   0x000007e12cbf1af8 <+432>:   bl      0x7e12cbf115c <opal_atomic_rmb>

158                 while (ghost == item->opal_list_next) {
   0x000007e12cbf1afc <+436>:   ld      r9,64(r31)
   0x000007e12cbf1b00 <+440>:   ld      r9,40(r9)
   0x000007e12cbf1b04 <+444>:   ld      r10,56(r31)
   0x000007e12cbf1b08 <+448>:   cmpd    cr7,r10,r9
   0x000007e12cbf1b0c <+452>:   beq     cr7,0x7e12cbf1af8 <opal_fifo_pop_atomic+432>

160                 }
161
162                 opal_atomic_rmb ();
   0x000007e12cbf1b10 <+456>:   bl      0x7e12cbf115c <opal_atomic_rmb>

163
164                 /* update the head with the real next value. note that no other thread
---Type <return> to continue, or q <return> to quit---
165                  * will be attempting to update the head until after it has been updated
166                  * with the next pointer. push will not see an empty list and other pop
167                  * operations will loop until the head is consistent. */
168                 fifo->opal_fifo_head.data.item = (opal_list_item_t *) item->opal_list_next;
   0x000007e12cbf1b14 <+460>:   ld      r9,64(r31)
   0x000007e12cbf1b18 <+464>:   ld      r10,40(r9)
   0x000007e12cbf1b1c <+468>:   ld      r9,40(r31)
   0x000007e12cbf1b20 <+472>:   std     r10,56(r9)

169                 opal_atomic_wmb ();
   0x000007e12cbf1b24 <+476>:   bl      0x7e12cbf1188 <opal_atomic_wmb>

170             }
171         }
172
173         item->opal_list_next = NULL;
   0x000007e12cbf1b28 <+480>:   ld      r9,64(r31)
   0x000007e12cbf1b2c <+484>:   li      r10,0
   0x000007e12cbf1b30 <+488>:   std     r10,40(r9)

174
175         return item;
   0x000007e12cbf1b34 <+492>:   ld      r9,64(r31)

176     }
   0x000007e12cbf1b38 <+496>:   mr      r3,r9
   0x000007e12cbf1b3c <+500>:   ld      r9,152(r31)
   0x000007e12cbf1b40 <+504>:   ld      r10,-28688(r13)
   0x000007e12cbf1b44 <+508>:   cmpld   cr7,r9,r10
   0x000007e12cbf1b48 <+512>:   li      r9,0
   0x000007e12cbf1b4c <+516>:   li      r10,0
   0x000007e12cbf1b50 <+520>:   beq     cr7,0x7e12cbf1b5c <opal_fifo_pop_atomic+532>
   0x000007e12cbf1b54 <+524>:   bl      0x7e12cbf0f00 <00000018.plt_call.__stack_chk_fail@@GLIBC_2.17>
   0x000007e12cbf1b58 <+528>:   ld      r2,24(r1)
   0x000007e12cbf1b5c <+532>:   addi    r1,r31,176
   0x000007e12cbf1b60 <+536>:   ld      r0,16(r1)
   0x000007e12cbf1b64 <+540>:   mtlr    r0
   0x000007e12cbf1b68 <+544>:   ld      r31,-8(r1)
   0x000007e12cbf1b6c <+548>:   blr
   0x000007e12cbf1b70 <+552>:   .long 0x0
   0x000007e12cbf1b74 <+556>:   .long 0x1000000
   0x000007e12cbf1b78 <+560>:   .long 0x1000180
End of assembler dump.

Note: LWPs 1,2,4-8 are all caught in this loop (I had gdb display some of the noteworth variable values):

(gdb)
(gdb) step
126             tail.value = fifo->opal_fifo_tail.value;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
127             opal_atomic_rmb ();
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:60
60          __atomic_thread_fence (__ATOMIC_ACQUIRE);
(gdb)
61      }
(gdb)
opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
129             item = (opal_list_item_t *) head.data.item;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
130             next = (opal_list_item_t *) item->opal_list_next;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
132             if (ghost == tail.data.item && ghost == item) {
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
137             if (tail.data.item != item && ghost != tail.data.item && ghost == next) {
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
138                 opal_read_counted_pointer (&fifo->opal_fifo_head, &head);
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
opal_read_counted_pointer (value=0x796661f7e590, addr=0x7fffeeb23f70) at ../../opal/class/opal_lifo.h:81
81          value->data.counter = addr->data.counter;
11: addr->data.counter = 13038893
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
24: value->data.counter = 13038893
25: addr->data.counter = 13038893
26: value->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
27: addr->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb)
82          opal_atomic_rmb ();
11: addr->data.counter = 13038893
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
24: value->data.counter = 13038893
25: addr->data.counter = 13038893
26: value->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
27: addr->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb)
opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:60
60          __atomic_thread_fence (__ATOMIC_ACQUIRE);
(gdb)
61      }
(gdb)
opal_read_counted_pointer (value=0x796661f7e590, addr=0x7fffeeb23f70) at ../../opal/class/opal_lifo.h:83
83          value->data.item = addr->data.item;
11: addr->data.counter = 13038893
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
24: value->data.counter = 13038893
25: addr->data.counter = 13038893
26: value->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
27: addr->data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb)
opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:139
139                 continue;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
126             tail.value = fifo->opal_fifo_tail.value;
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7fffeeb23f90
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7fffeeb23f90
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
(gdb)
(gdb) info thread
  Id   Target Id         Frame
  1    Thread 0x796665cb55c0 (LWP 27439) "opal_fifo" 0x0000796665aa9db4 in __pthread_join (threadid=133480637264256, thread_return=0x7fffeeb23ed8)
    at pthread_join.c:90
  2    Thread 0x796661f7f180 (LWP 27466) "opal_fifo" 0x000007e12cbf19f0 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40)
    at ../../opal/class/opal_fifo.h:132
* 3    Thread 0x79666277f180 (LWP 27467) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:158
  4    Thread 0x796662f7f180 (LWP 27468) "opal_fifo" 0x000007e12cbf1a40 in opal_fifo_pop_atomic (fifo=0x7fffeeb23f40)
    at ../../opal/class/opal_fifo.h:137
  5    Thread 0x79666377f180 (LWP 27469) "opal_fifo" opal_read_counted_pointer (value=0x79666377e590, addr=0x7fffeeb23f70)
    at ../../opal/class/opal_lifo.h:83
  6    Thread 0x796663f7f180 (LWP 27470) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  7    Thread 0x79666577f180 (LWP 27471) "opal_fifo" opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:129
  8    Thread 0x796664f7f180 (LWP 27472) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  9    Thread 0x79666477f180 (LWP 27473) "opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61

Note: Thread 3 is looping a bit farther down

(gdb) thread 3
opal_fifo_pop_atomic (fifo=0x7fffeeb23f40) at ../../opal/class/opal_fifo.h:158
158                 while (ghost == item->opal_list_next) {
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7e168eb8610
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038892, item = 0x7e168eb8610}, value = 0x000007e168eb86100000000000c6f52c}
28: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb) step
159                     opal_atomic_rmb ();
14: tail.value = 0x000007e168eb86100000000000c6f4c9
15: fifo->opal_fifo_tail.value = 0x000007e168eb86100000000000c6f4c9
16: item = (opal_list_item_t *) 0x7e168eb8610
17: head.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
18: next = (opal_list_item_t *) 0x7fffeeb23f90
19: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
20: ghost = (opal_list_item_t *) 0x7fffeeb23f90
21: tail.data.item = (volatile opal_list_item_t * volatile) 0x7e168eb8610
22: fifo->opal_fifo_head = {data = {counter = 13038893, item = 0x7fffeeb23f90}, value = 0x00007fffeeb23f900000000000c6f52d}
23: head = {data = {counter = 13038892, item = 0x7e168eb8610}, value = 0x000007e168eb86100000000000c6f52c}
28: item->opal_list_next = (volatile struct opal_list_item_t * volatile) 0x7fffeeb23f90
(gdb) step
opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:60
60          __atomic_thread_fence (__ATOMIC_ACQUIRE);
(gdb) step
61      }

Is there any additional information that I could collect that would be helpful to diagnose the issue?

gpaulsen commented 6 years ago

verifying.

gpaulsen commented 6 years ago

Verified. I hit the hang in Iteration4.

(gdb) info thread
  Id   Target Id         Frame
  9    Thread 0x10000265f1b0 (LWP 90430) "lt-opal_fifo" opal_atomic_rmb ()
    at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  8    Thread 0x10000224f1b0 (LWP 90431) "lt-opal_fifo" 0x00000000100017e0 in opal_fifo_pop_atomic (
    fifo=0x3fffff302510) at ../../opal/class/opal_fifo.h:137
  7    Thread 0x100001e3f1b0 (LWP 90432) "lt-opal_fifo" 0x00000000100017e8 in opal_fifo_pop_atomic (
    fifo=0x3fffff302510) at ../../opal/class/opal_fifo.h:137
  6    Thread 0x100001a2f1b0 (LWP 90433) "lt-opal_fifo" opal_read_counted_pointer (value=0x100001a2e5e0,
    addr=0x3fffff302540) at ../../opal/class/opal_lifo.h:83
  5    Thread 0x10000161f1b0 (LWP 90434) "lt-opal_fifo" opal_atomic_rmb ()
    at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  4    Thread 0x10000120f1b0 (LWP 90435) "lt-opal_fifo" opal_fifo_pop_atomic (fifo=0x3fffff302510)
    at ../../opal/class/opal_fifo.h:138
  3    Thread 0x100000dff1b0 (LWP 90436) "lt-opal_fifo" 0x0000000010001800 in opal_fifo_pop_atomic (
    fifo=0x3fffff302510) at ../../opal/class/opal_fifo.h:138
  2    Thread 0x1000009ef1b0 (LWP 90437) "lt-opal_fifo" 0x00000000100017ec in opal_fifo_pop_atomic (
    fifo=0x3fffff302510) at ../../opal/class/opal_fifo.h:137
* 1    Thread 0x100000045570 (LWP 90404) "lt-opal_fifo" 0x00001000003ba0d8 in pthread_join ()
   from /lib64/libpthread.so.0

Tomorrow I'll debug.

gpaulsen commented 6 years ago

I've verified --disable-builtin-atomics works around the problem on power8.

mksully22 commented 6 years ago

Do you know if you can hit the problem on Power9?

gpaulsen commented 6 years ago

I'll check. I hope that https://github.com/open-mpi/ompi/pull/5374 resolved this on Power9, but I didn't try the stress test.

mksully22 commented 6 years ago

I'll check on my Power9 system as well

gpaulsen commented 6 years ago

Power9 has run over 140 Iterations successfully. I did not disable the buildin atomics when testing on power9.

So, looks like it's just a power8 issue. Power8 does by default detect this:

   checking for __atomic builtin atomics... yes
   checking for processor support of __atomic builtin atomic compare-and-swap on 128-bit values... yes
   checking if __int128 atomic compare-and-swap is always lock-free... yes
mksully22 commented 6 years ago

I didn't have as much luck on my Power9 system. With ./configure --enable-debug --prefix=/usr --mandir=/usr/share/man --sysconfdir=/etc/openmpi --enable-ipv6 --with-threads=posix --with-hwloc=/usr I still get hangs. The gdb output is below. Note: I repeated the stress test with --disable-builtin-atomics added to the configure and could not reproduce the problem.

/tmp # gdb attach 107
GNU gdb (GDB) 8.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
attach: No such file or directory.
Attaching to process 107
[New LWP 129]
[New LWP 130]
[New LWP 131]
[New LWP 132]
[New LWP 133]
[New LWP 134]
[New LWP 135]
[New LWP 136]
0x00007cd545b21fe0 in __clone () from /lib/ld-musl-powerpc64le.so.1
(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 107 "lt-opal_fifo" 0x00007cd545b21fe0 in __clone () from /lib/ld-musl-powerpc64le.so.1
  2    LWP 129 "lt-opal_fifo" 0x00000f5f7aa11acc in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:137
  3    LWP 130 "lt-opal_fifo" 0x00000f5f7aa11a74 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
  4    LWP 131 "lt-opal_fifo" opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
  5    LWP 132 "lt-opal_fifo" 0x00000f5f7aa11adc in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:138
  6    LWP 133 "lt-opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
  7    LWP 134 "lt-opal_fifo" 0x00000f5f7aa11a70 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
  8    LWP 135 "lt-opal_fifo" 0x00000f5f7aa11a5c in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:126
  9    LWP 136 "lt-opal_fifo" opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
(gdb) thread apply all bt

Thread 9 (LWP 136):
#0  opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
#1  0x00000f5f7aa11b88 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:159
#2  0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#3  0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#4  0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1

Thread 8 (LWP 135):
#0  0x00000f5f7aa11a5c in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:126
#1  0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2  0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3  0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1

Thread 7 (LWP 134):
#0  0x00000f5f7aa11a70 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
#1  0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2  0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3  0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1

Thread 6 (LWP 133):
#0  opal_atomic_rmb () at ../../opal/include/opal/sys/gcc_builtin/atomic.h:61
#1  0x00000f5f7aa11a64 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:127
#2  0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#3  0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#4  0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1

Thread 5 (LWP 132):
#0  0x00000f5f7aa11adc in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:138
#1  0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2  0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3  0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1

Thread 4 (LWP 131):
#0  opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
#1  0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2  0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3  0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1

Thread 3 (LWP 130):
#0  0x00000f5f7aa11a74 in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:130
#1  0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2  0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3  0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1

Thread 2 (LWP 129):
#0  0x00000f5f7aa11acc in opal_fifo_pop_atomic (fifo=0x7ffffae2d840) at ../../opal/class/opal_fifo.h:137
#1  0x00000f5f7aa11fb4 in thread_test_exhaust (arg=0x7ffffae2d840) at opal_fifo.c:80
#2  0x00007cd545b148f0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3  0x00007cd545b21fac in __clone () from /lib/ld-musl-powerpc64le.so.1

Thread 1 (LWP 107):
#0  0x00007cd545b21fe0 in __clone () from /lib/ld-musl-powerpc64le.so.1
#1  0x00007cd545b138c0 in ?? () from /lib/ld-musl-powerpc64le.so.1
#2  0x00007cd545b12374 in ?? () from /lib/ld-musl-powerpc64le.so.1
#3  0x00007cd545b12478 in __timedwait_cp () from /lib/ld-musl-powerpc64le.so.1
Backtrace stopped: frame did not save the PC
(gdb) detach
Detaching from program: /tmp/ompi/test/class/.libs/lt-opal_fifo, process 107
(gdb) quit
/tmp # cat /proc/cpuinfo
processor       : 0
cpu             : POWER9, altivec supported
clock           : 2300.000000MHz
revision        : 2.2 (pvr 004e 1202)

processor       : 1
cpu             : POWER9, altivec supported
clock           : 2300.000000MHz
revision        : 2.2 (pvr 004e 1202)

processor       : 2
cpu             : POWER9, altivec supported
clock           : 2300.000000MHz
revision        : 2.2 (pvr 004e 1202)

processor       : 3
cpu             : POWER9, altivec supported
clock           : 2300.000000MHz
revision        : 2.2 (pvr 004e 1202)

processor       : 4
cpu             : POWER9, altivec supported
clock           : 2300.000000MHz
revision        : 2.2 (pvr 004e 1202)
===trim===
gpaulsen commented 6 years ago

So you power9 was hanging also? My power9 that seemed to work was: RHEL 7.5, gcc 4.8.5. it's also a rev 2.2

mksully22 commented 6 years ago

Yes, power9 hung too. I tried it on Ubuntu 18.04, gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)

I also tried it in an Alpine container hosted on Ubuntu 18.04 and experienced the hang there too.

jsquyres commented 6 years ago

@mksully22 Github pro tip: use a line of 3 back ticks to start and end verbatim sections in github to make the text render better (see https://help.github.com/articles/creating-and-highlighting-code-blocks/).

jsquyres commented 6 years ago

@gpaulsen Any progress?

jsquyres commented 6 years ago

On the Webex today, @hjelmn says he'll go have a look at this. One possibility -- and @hjelmn will need to check with @bosilca -- is to disable the builtin atomics on v4.0.0.

bosilca commented 6 years ago

Once we pull #5445 we should be able to even drop builtin atomics completely.

gpaulsen commented 4 years ago

Removing critical since there is a workaround for disabling builtin atomics. I'll rerun the test on v4.0.x and see where it stands today.

gpaulsen commented 4 years ago

@bosilca https://github.com/open-mpi/ompi/pull/5445 has been merged. Should we open an Issue to track removing buildin atomics on master before v5.0.x?

awlauria commented 3 years ago

This should be resolved by dropping back to the atomic assembly on power machines in these pr's to v4.1 and v5:

v4.1.x: #8708 v5.0.x: #8710 master: https://github.com/open-mpi/ompi/pull/8649 - has some performance numbers and discussion. it looks like it is not going back to v4. However, users can compile with the configure option --disable-builtin-atomics to fix this on v4.0.x.

See issue:

https://github.com/open-mpi/ompi/issues/2966