open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 863 forks source link

segfault when trying to open significantly too many contexts #10370

Open joshfisher-cornelisnetworks opened 2 years ago

joshfisher-cornelisnetworks commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

Found issue when using a 2 HFI system but 1 was disabled causing command to open way too many contexts for 1 HFI

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

No output

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

In a 2 HFI system there was 1 HFI disabled and ran a test that would work on a 2 HFI system. Expected a failure due to too many contexts, but got a segfault instead of a more graceful abort. Found that when running with np and ppr closer to the limit, but still over, there is a more graceful abort.

command ran: openmpi-v4.1.2/bin/mpirun -np 192 --map-by ppr:96:node -host hostA:96,hostB:96 --bind-to core --display-map --tag-output --allow-run-as-root --mca mtl ofi --mca btl ofi -x LD_LIBRARY_PATH=path/to/opx/build -x FI_PROVIDER=opx FI_LOG_LEVEL=warn -x IMB-MPI1 -include Uniband,Biband -npmin 192 -iter 10000 -msglog 0:15

Backtrace found:

#0 0x00007fd9e35eb6e8 in mca_btl_ofi_context_finalize () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#1 0x00007fd9e35ebab9 in mca_btl_ofi_context_alloc_scalable () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#2 0x00007fd9e35e7f9f in mca_btl_ofi_component_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#3 0x00007fd9f3a74d16 in mca_btl_base_select () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libopen-pal.so.40
#4 0x00007fd9e37f2441 in mca_bml_r2_component_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_bml_r2.so
#5 0x00007fd9f4e5f3ce in mca_bml_base_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#6 0x00007fd9f4e9d4fd in ompi_mpi_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#7 0x00007fd9f4e46875 in PMPI_Init_thread () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#8 0x0000000000405265 in main (argc=9, argv=0x7ffdfc1a2938) at imb.cpp:295
jsquyres commented 2 years ago

@joshfisher-cornelisnetworks Did you intend for this to be a self-assigned issue? I.e., if it's an HFI issue, that's a Cornelis issue, which is you, right?

mwheinz commented 2 years ago

Josh, did you mean to open a Jira?

joshfisher-cornelisnetworks commented 2 years ago

No the fault is consistently happening in the OMPI part of the code and everything we have been able to find seems to point to this being an OMPI issue. We expect a fault, but it looks like at some point, OMPI creates a segfault instead of a more graceful fault. Let me double check with who I have been working with on this issue, but last we talked, we decided it was worthy of an OMPI bug.

mwheinz commented 2 years ago

So, as point of history, the OFI BTL was originally written by Intel as part of the OmniPath project. If the failure is in the OFI BTL it might be Cornelis’ responsibility now. It’s not clear. You might want to reach out to Sean Hefty - check OFIWG/libfabric to reach out to him.

jsquyres commented 2 years ago

Sean Hefty won't have much of a clue on Open MPI code -- he's more the libfabric guy than the Open MPI guy. Regardless, unless there's a non-HFI reproducer, I don't know if anyone else in the Open MPI community can work on this, because no one else will have HFI hardware.

joshfisher-cornelisnetworks commented 2 years ago

Have a pull request for the issue that caused this segfault targeted for v4.1.x.