Open dschulzg opened 2 years ago
@dschulzg the logs strongly suggests UCX is not used but Open MPI is using the legacy btl/openib
component.
You can mpirun --mca pml ucx ...
in order to force MPI to use UCX or aborts if it is not available.
I forced UCX as you said but it still segfaults but only sometimes. I ran the test program in a loop 300 times and it crashed 8 times. The stack dump looks the same to me:
[mc83:1867714] *** Process received signal ***
[mc83:1867714] Signal: Segmentation fault (11)
[mc83:1867714] Signal code: Address not mapped (1)
[mc83:1867714] Failing at address: 0x7fb210ae70e0
[mc83:1867714] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x7fb2103b5c20]
[mc83:1867714] [ 1] /lib64/libibverbs.so.1(+0xc7f5)[0x7fb1ff9d57f5]
[mc83:1867714] [ 2] /lib64/libibverbs.so.1(ibv_create_comp_channel+0x5a)[0x7fb1ff9e0caa]
[mc83:1867714] [ 3] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(+0x291b1)[0x7fb2043451b1]
[mc83:1867714] [ 4] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(opal_btl_openib_connect_base_select_for_local_port+0x112)[0x7fb20433ee82]
[mc83:1867714] [ 5] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_btl_openib.so(+0x10c0d)[0x7fb20432cc0d]
[mc83:1867714] [ 6] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libopen-pal.so.40(mca_btl_base_select+0xdd)[0x7fb20fa9608d]
[mc83:1867714] [ 7] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7fb20475bfe2]
[mc83:1867714] [ 8] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7fb210660ed4]
[mc83:1867714] [ 9] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(ompi_mpi_init+0x614)[0x7fb2106c0ae4]
[mc83:1867714] [10] /global/software/openmpi/gnu-8.4.1/4.1.3/lib/libmpi.so.40(MPI_Init+0x81)[0x7fb210645681]
[mc83:1867714] [11] ./reallysmall[0x4007ad]
[mc83:1867714] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fb210001493]
[mc83:1867714] [13] ./reallysmall[0x4006be]
[mc83:1867714] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 1867714 on node mc83 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
That being said, it only outputs that stacktrace from one of the 2 nodes in the job after setting --mca pml ucx on the ones that do segfault.
This is puzzling ...
Anyway, since the crash occurs at MPI_Init()
time, maybe you can avoid it by explicitly disabling the btl/openib
component:
mpirun --mca pml ucx --mca btl ^openib ...
I decided to reevaluate the assertion that this happens at MPI_Init time and I can say I thought it was happening there likely because I forgot to flush the printf before the segfault and missed the message where it actually happened. This is actually happening at MPI_Finalize time. I did find that putting a barrier right before the finalize made the condition much less likely to happen - but still did happen 1/20 ish. Sorry for the confusion.
The --mca btl ^openib did fix the problem both in our existing v4.1.1 and new v4.1.3 installations of openmpi. I'm content just to default the ^openmpi btl option to get this working but if "OpenMPI" wants to follow this further I can test things because the only thing that's changed between working nodes and non-working nodes are slightly newer hardware -- all in the same IB fabric
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Problem hapens with v4.1.3 and 4.1.1 (noticed with 4.1.1 and then upgraded to 4.1.3 to see if that fixed it.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source tarball
(I tried both --with and --without for each of verbs,ucx, and psm2)
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
The biggest difference between working nodes and broken ones are the cpus: Here is a snippet of a diff of the /proc/cpuinfo flags with the ones on the left of the <> are the slightly older Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz (in Dell 6420's) and the new flags on the right column are Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz (in Dell c6520's -- note slightly different model of chassis)
Kernel 4.18.0-348.12.2.el8_5.x86_64
Details of the problem
Running this program gives the following output:
I've tried with and without installing Mellanox OFED (vs Rocky's distributed one) Upgraded UCX to 1.12 from 1.9, trying the newest glibc and libverbs from Rocky 8.5's repos
Also tried new firmware on the IB HCA.
I'm out of things to try upgrading at this point. Any thoughts?
Thanks -Dave