open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 845 forks source link

Segfault in MPI_Finalize if MPI_Barrier isn't called first. #10313

Open dschulzg opened 2 years ago

dschulzg commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Problem hapens with v4.1.3 and 4.1.1 (noticed with 4.1.1 and then upgraded to 4.1.3 to see if that fixed it.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source tarball

./configure --prefix=/global/software/openmpi/gnu-8.4.1/4.1.3 --without-tm --with-pmix=internal --with-pmi --with-hwloc=internal --with-ucx --without-psm2 --without-verbs

(I tried both --with and --without for each of verbs,ucx, and psm2)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

The biggest difference between working nodes and broken ones are the cpus: Here is a snippet of a diff of the /proc/cpuinfo flags with the ones on the left of the <> are the slightly older Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz (in Dell 6420's) and the new flags on the right column are Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz (in Dell c6520's -- note slightly different model of chassis)

# diff <(ssh mc1 grep flags /proc/cpuinfo |head -n1|tr " " "\n")  <(ssh mc82 grep flags /proc/cpuinfo |head -n1|tr " " "\n") --side-by-side |egrep '<|>'|tr -s "\t"

cdp_l3        <
          > sgx
mpx       <
          > avx512ifma
          > sha_ni
          > split_lock_detect
          > wbnoinvd
          > avx512vbmi
          > umip
          > avx512_vbmi2
          > gfni
          > vaes
          > vpclmulqdq
          > avx512_bitalg
          > tme
          > avx512_vpopcntdq
          > la57
          > rdpid
          > sgx_lc
          > fsrm
          > pconfig

Details of the problem

Running this program gives the following output:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {

  // Initialize the MPI environment. The two arguments to MPI Init are not
  // currently used by MPI implementations, but are there in case future
  // implementations might need the arguments.
  printf("About to MPI_Init.\n");
  fflush(stdout);
  MPI_Init(NULL, NULL);
  printf("Finished MPI_Init.\n");
  fflush(stdout);
  printf("About to MPI_finalize.\n");
  fflush(stdout);
  // Finalize the MPI environment. No more MPI calls can be made after this
  MPI_Finalize();
  printf("Finished MPI_Finalize.\n");
  fflush(stdout);

}
$ salloc -N2 --ntasks-per-node=1 --mem=3G --time=1:00:00 -p cpu2022 --reservation=dstest 
salloc: Granted job allocation 13764546
salloc: Waiting for resource configuration
salloc: Nodes mc[82-83] are ready for job
[user1@mc82 helloworld]$ mpirun ./reallysmall
About to MPI_Init.
About to MPI_Init.
[mc83:1832927] *** Process received signal ***
[mc83:1832927] Signal: Segmentation fault (11)
[mc83:1832927] Signal code: Address not mapped (1)
[mc83:1832927] Failing at address: 0x7fd3fac5e0e0
[mc83:1832927] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x7fd3fa52cc20]
[mc83:1832927] [ 1] /lib64/libibverbs.so.1(+0xc7f5)[0x7fd3edb2b7f5]
[mc83:1832927] [ 2] /lib64/libibverbs.so.1(ibv_create_comp_channel+0x5a)[0x7fd3edb36caa]
[mc83:1832927] [ 3] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(+0x291f1)[0x7fd3ee39a1f1]
[mc83:1832927] [ 4] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(opal_btl_openib_connect_base_select_for_local_port+0x112)[0x7fd3ee393ec2]
[mc83:1832927] [ 5] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(+0x10c4d)[0x7fd3ee381c4d]
[mc83:1832927] [ 6] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libopen-pal.so.40(mca_btl_base_select+0xdd)[0x7fd3f9c0d08d]
[mc83:1832927] [ 7] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7fd3ee7b0fe2]
[mc83:1832927] [ 8] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7fd3fa7d7ed4]
[mc83:1832927] [ 9] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(ompi_mpi_init+0x614)[0x7fd3fa837ae4]
[mc83:1832927] [10] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(MPI_Init+0x81)[0x7fd3fa7bc681]
[mc83:1832927] [11] ./reallysmall[0x4007ad]
[mc83:1832927] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fd3fa178493]
[mc83:1832927] [13] ./reallysmall[0x4006be]
[mc83:1832927] *** End of error message ***
[mc82:3778110] *** Process received signal ***
[mc82:3778110] Signal: Segmentation fault (11)
[mc82:3778110] Signal code: Address not mapped (1)
[mc82:3778110] Failing at address: 0x7fd34873e0e0
[mc82:3778110] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x7fd34800cc20]
[mc82:3778110] [ 1] /lib64/libibverbs.so.1(+0xc7f5)[0x7fd3375817f5]
[mc82:3778110] [ 2] /lib64/libibverbs.so.1(ibv_create_comp_channel+0x5a)[0x7fd33758ccaa]
[mc82:3778110] [ 3] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(+0x291f1)[0x7fd337df01f1]
[mc82:3778110] [ 4] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(opal_btl_openib_connect_base_select_for_local_port+0x112)[0x7fd337de9ec2]
[mc82:3778110] [ 5] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(+0x10c4d)[0x7fd337dd7c4d]
[mc82:3778110] [ 6] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libopen-pal.so.40(mca_btl_base_select+0xdd)[0x7fd3476ed08d]
[mc82:3778110] [ 7] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7fd33c342fe2]
[mc82:3778110] [ 8] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7fd3482b7ed4]
[mc82:3778110] [ 9] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(ompi_mpi_init+0x614)[0x7fd348317ae4]
[mc82:3778110] [10] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(MPI_Init+0x81)[0x7fd34829c681]
[mc82:3778110] [11] ./reallysmall[0x4007ad]
[mc82:3778110] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fd347c58493]
[mc82:3778110] [13] ./reallysmall[0x4006be]
[mc82:3778110] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mc82 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I've tried with and without installing Mellanox OFED (vs Rocky's distributed one) Upgraded UCX to 1.12 from 1.9, trying the newest glibc and libverbs from Rocky 8.5's repos

Also tried new firmware on the IB HCA.

I'm out of things to try upgrading at this point. Any thoughts?

Thanks -Dave

ggouaillardet commented 2 years ago

@dschulzg the logs strongly suggests UCX is not used but Open MPI is using the legacy btl/openib component.

You can mpirun --mca pml ucx ... in order to force MPI to use UCX or aborts if it is not available.

dschulzg commented 2 years ago

I forced UCX as you said but it still segfaults but only sometimes. I ran the test program in a loop 300 times and it crashed 8 times. The stack dump looks the same to me:

[mc83:1867714] *** Process received signal ***
[mc83:1867714] Signal: Segmentation fault (11)
[mc83:1867714] Signal code: Address not mapped (1)
[mc83:1867714] Failing at address: 0x7fb210ae70e0
[mc83:1867714] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x7fb2103b5c20]
[mc83:1867714] [ 1] /lib64/libibverbs.so.1(+0xc7f5)[0x7fb1ff9d57f5]
[mc83:1867714] [ 2] /lib64/libibverbs.so.1(ibv_create_comp_channel+0x5a)[0x7fb1ff9e0caa]
[mc83:1867714] [ 3] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(+0x291b1)[0x7fb2043451b1]
[mc83:1867714] [ 4] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(opal_btl_openib_connect_base_select_for_local_port+0x112)[0x7fb20433ee82]
[mc83:1867714] [ 5] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(+0x10c0d)[0x7fb20432cc0d]
[mc83:1867714] [ 6] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libopen-pal.so.40(mca_btl_base_select+0xdd)[0x7fb20fa9608d]
[mc83:1867714] [ 7] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7fb20475bfe2]
[mc83:1867714] [ 8] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7fb210660ed4]
[mc83:1867714] [ 9] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(ompi_mpi_init+0x614)[0x7fb2106c0ae4]
[mc83:1867714] [10] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(MPI_Init+0x81)[0x7fb210645681]
[mc83:1867714] [11] ./reallysmall[0x4007ad]
[mc83:1867714] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fb210001493]
[mc83:1867714] [13] ./reallysmall[0x4006be]
[mc83:1867714] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 1867714 on node mc83 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

That being said, it only outputs that stacktrace from one of the 2 nodes in the job after setting --mca pml ucx on the ones that do segfault.

ggouaillardet commented 2 years ago

This is puzzling ... Anyway, since the crash occurs at MPI_Init() time, maybe you can avoid it by explicitly disabling the btl/openib component:

mpirun --mca pml ucx --mca btl ^openib ...
dschulzg commented 2 years ago

I decided to reevaluate the assertion that this happens at MPI_Init time and I can say I thought it was happening there likely because I forgot to flush the printf before the segfault and missed the message where it actually happened. This is actually happening at MPI_Finalize time. I did find that putting a barrier right before the finalize made the condition much less likely to happen - but still did happen 1/20 ish. Sorry for the confusion.

The --mca btl ^openib did fix the problem both in our existing v4.1.1 and new v4.1.3 installations of openmpi. I'm content just to default the ^openmpi btl option to get this working but if "OpenMPI" wants to follow this further I can test things because the only thing that's changed between working nodes and non-working nodes are slightly newer hardware -- all in the same IB fabric