orted crash with v4.0.5 and BUILTIN_GCC (GCC v7.3.1)

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

RPM built from the internal spec file.

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

N/A

Please describe the system on which you are running

Operating system/version: Amazon Linux 2 (Fedora-based)
Computer hardware: AWS Graviton2 instance (with Arm Neoverse-N1 cores)
Network type: EFA (so, OFI MTL)

Details of the problem

We are seeing occasional orted segfaults on this platform when running with a default configuration:

==== starting mpirun --prefix /opt/amazon/openmpi --wdir results/omb/collective/osu_allgatherv -n 2048 -N 64 --tag-output  --hostfile /fsx/hfile -x PATH -x LD_LIBRARY_PATH /fsx/dkothar/SubspaceBenchmarks/spack/opt/spack/linux-amzn2-aarch64/gcc-7.3.1/osu-micro-benchmarks-5.6-xmuoliterjpnfcnhn2wpapdpdisfrmrx/libexec/osu-micro-benchmarks/mpi/collective/osu_allgatherv -x 10 -i 10 : Mon Nov 30 14:52:04 UTC 2020 ====
[ip-172-31-15-226:13802] *** Process received signal ***
[ip-172-31-15-226:13802] Signal: Segmentation fault (11)
[ip-172-31-15-226:13802] Signal code: Address not mapped (1)
[ip-172-31-15-226:13802] Failing at address: (nil)
[ip-172-31-15-226:13802] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000202a5668]
[ip-172-31-15-226:13802] [ 1] /opt/amazon/openmpi/lib64/libopen-rte.so.40(orte_state_base_activate_proc_state+0xcc)[0x40002034f710]
[ip-172-31-15-226:13802] [ 2] /opt/amazon/openmpi/lib64/libopen-rte.so.40(orte_odls_base_spawn_proc+0x4fc)[0x40002032397c]
[ip-172-31-15-226:13802] [ 3] /opt/amazon/openmpi/lib64/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdb0)[0x40002041eed0]
[ip-172-31-15-226:13802] [ 4] /opt/amazon/openmpi/lib64/libopen-pal.so.40(+0x3e038)[0x4000203da038]
[ip-172-31-15-226:13802] [ 5] /lib64/libpthread.so.0(+0x72ac)[0x4000206562ac]
[ip-172-31-15-226:13802] [ 6] /lib64/libc.so.6(+0xd5e9c)[0x400020759e9c]
[ip-172-31-15-226:13802] *** End of error message ***
bash: line 1: 13802 Segmentation fault      (core dumped) /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "2449276928" -mca ess_base_vpid 31 -mca ess_base_num_procs "32" -mca orte_node_regex "ip-[3:172]-31-4-217,[3:172].31.14.206,[3:172].31.10.112,[3:172].31.1.198,[3:172].31.8.200,[3:172].31.14.151,[3:172].31.7.206,[3:172].31.7.136,[3:172].31.2.252,[3:172].31.9.88,[3:172].31.13.44,[3:172].31.1.14,[3:172].31.9.249,[3:172].31.0.146,[3:172].31.3.111,[3:172].31.4.58,[3:172].31.12.94,[3:172].31.4.81,[3:172].31.4.249,[3:172].31.6.39,[3:172].31.7.103,[3:172].31.11.148,[3:172].31.0.23,[3:172].31.0.165,[3:172].31.5.196,[3:172].31.10.50,[3:172].31.11.232,[3:172].31.2.153,[3:172].31.3.106,[3:172].31.10.135,[3:172].31.14.82,[1:3].235.16.187@0(32)" -mca orte_hnp_uri "2449276928.0;tcp://172.31.4.217:37933" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "2449276928.0;tcp://172.31.4.217:37933" -mca rmaps_ppr_n_pernode "64" -mca orte_tag_output "1" -mca pmix "^s1,s2,cray,isolated"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[37373,0],0] on node ip-172-31-4-217
  Remote daemon: [[37373,0],31] on node 3.235.16.187

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
return status: 205

Looks like cls_constructor_array is NULL inside the opal thread. opal_cls_initialize() should have initialized this. The initialization seems to be protected by an atomic lock. After suspecting the atomic implementation on this platform, we disabled the builtins (--disable-builtin-atomics) and the issue does not seem to happen. By default in the 4.0.x branch, we are using BUILTIN_GCC atomics (from GCC v7.3.1 in this case). After disabling it, we use the arm64-specific assembly for atomic ops. The issue also does not happen with distros that have a newer GCC (> v9; like in Ubuntu 20) .

This failure been non-deterministic and hard to reproduce consistently. Do we have any known issues in this code path? I've gone through open issues related related to atomics and Arm and could not find anything in particular that might be causing this. Wanted to put feelers out while we continue to debug.

cc: @hjelmn

open-mpi / ompi