open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 859 forks source link

orted crash with v4.0.5 and BUILTIN_GCC (GCC v7.3.1) #8268

Open rajachan opened 3 years ago

rajachan commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

RPM built from the internal spec file.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

N/A

Please describe the system on which you are running


Details of the problem

We are seeing occasional orted segfaults on this platform when running with a default configuration:

==== starting mpirun --prefix /opt/amazon/openmpi --wdir results/omb/collective/osu_allgatherv -n 2048 -N 64 --tag-output  --hostfile /fsx/hfile -x PATH -x LD_LIBRARY_PATH /fsx/dkothar/SubspaceBenchmarks/spack/opt/spack/linux-amzn2-aarch64/gcc-7.3.1/osu-micro-benchmarks-5.6-xmuoliterjpnfcnhn2wpapdpdisfrmrx/libexec/osu-micro-benchmarks/mpi/collective/osu_allgatherv -x 10 -i 10 : Mon Nov 30 14:52:04 UTC 2020 ====
[ip-172-31-15-226:13802] *** Process received signal ***
[ip-172-31-15-226:13802] Signal: Segmentation fault (11)
[ip-172-31-15-226:13802] Signal code: Address not mapped (1)
[ip-172-31-15-226:13802] Failing at address: (nil)
[ip-172-31-15-226:13802] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000202a5668]
[ip-172-31-15-226:13802] [ 1] /opt/amazon/openmpi/lib64/libopen-rte.so.40(orte_state_base_activate_proc_state+0xcc)[0x40002034f710]
[ip-172-31-15-226:13802] [ 2] /opt/amazon/openmpi/lib64/libopen-rte.so.40(orte_odls_base_spawn_proc+0x4fc)[0x40002032397c]
[ip-172-31-15-226:13802] [ 3] /opt/amazon/openmpi/lib64/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdb0)[0x40002041eed0]
[ip-172-31-15-226:13802] [ 4] /opt/amazon/openmpi/lib64/libopen-pal.so.40(+0x3e038)[0x4000203da038]
[ip-172-31-15-226:13802] [ 5] /lib64/libpthread.so.0(+0x72ac)[0x4000206562ac]
[ip-172-31-15-226:13802] [ 6] /lib64/libc.so.6(+0xd5e9c)[0x400020759e9c]
[ip-172-31-15-226:13802] *** End of error message ***
bash: line 1: 13802 Segmentation fault      (core dumped) /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "2449276928" -mca ess_base_vpid 31 -mca ess_base_num_procs "32" -mca orte_node_regex "ip-[3:172]-31-4-217,[3:172].31.14.206,[3:172].31.10.112,[3:172].31.1.198,[3:172].31.8.200,[3:172].31.14.151,[3:172].31.7.206,[3:172].31.7.136,[3:172].31.2.252,[3:172].31.9.88,[3:172].31.13.44,[3:172].31.1.14,[3:172].31.9.249,[3:172].31.0.146,[3:172].31.3.111,[3:172].31.4.58,[3:172].31.12.94,[3:172].31.4.81,[3:172].31.4.249,[3:172].31.6.39,[3:172].31.7.103,[3:172].31.11.148,[3:172].31.0.23,[3:172].31.0.165,[3:172].31.5.196,[3:172].31.10.50,[3:172].31.11.232,[3:172].31.2.153,[3:172].31.3.106,[3:172].31.10.135,[3:172].31.14.82,[1:3].235.16.187@0(32)" -mca orte_hnp_uri "2449276928.0;tcp://172.31.4.217:37933" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "2449276928.0;tcp://172.31.4.217:37933" -mca rmaps_ppr_n_pernode "64" -mca orte_tag_output "1" -mca pmix "^s1,s2,cray,isolated"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[37373,0],0] on node ip-172-31-4-217
  Remote daemon: [[37373,0],31] on node 3.235.16.187

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
return status: 205

Looks like cls_constructor_array is NULL inside the opal thread. opal_cls_initialize() should have initialized this. The initialization seems to be protected by an atomic lock. After suspecting the atomic implementation on this platform, we disabled the builtins (--disable-builtin-atomics) and the issue does not seem to happen. By default in the 4.0.x branch, we are using BUILTIN_GCC atomics (from GCC v7.3.1 in this case). After disabling it, we use the arm64-specific assembly for atomic ops. The issue also does not happen with distros that have a newer GCC (> v9; like in Ubuntu 20) .

This failure been non-deterministic and hard to reproduce consistently. Do we have any known issues in this code path? I've gone through open issues related related to atomics and Arm and could not find anything in particular that might be causing this. Wanted to put feelers out while we continue to debug.

cc: @hjelmn

rhc54 commented 3 years ago

I'm unaware of any problems down in there, but that doesn't mean something couldn't exist. The issue in the opal thread sounds very suspicious, however, especially if it works using different atomics.