Open joshfisher-cornelisnetworks opened 2 years ago
@joshfisher-cornelisnetworks Did you intend for this to be a self-assigned issue? I.e., if it's an HFI issue, that's a Cornelis issue, which is you, right?
Josh, did you mean to open a Jira?
No the fault is consistently happening in the OMPI part of the code and everything we have been able to find seems to point to this being an OMPI issue. We expect a fault, but it looks like at some point, OMPI creates a segfault instead of a more graceful fault. Let me double check with who I have been working with on this issue, but last we talked, we decided it was worthy of an OMPI bug.
So, as point of history, the OFI BTL was originally written by Intel as part of the OmniPath project. If the failure is in the OFI BTL it might be Cornelis’ responsibility now. It’s not clear. You might want to reach out to Sean Hefty - check OFIWG/libfabric to reach out to him.
Sean Hefty won't have much of a clue on Open MPI code -- he's more the libfabric guy than the Open MPI guy. Regardless, unless there's a non-HFI reproducer, I don't know if anyone else in the Open MPI community can work on this, because no one else will have HFI hardware.
Have a pull request for the issue that caused this segfault targeted for v4.1.x.
Thank you for taking the time to submit an issue!
Background information
Found issue when using a 2 HFI system but 1 was disabled causing command to open way too many contexts for 1 HFI
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.No output
Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
In a 2 HFI system there was 1 HFI disabled and ran a test that would work on a 2 HFI system. Expected a failure due to too many contexts, but got a segfault instead of a more graceful abort. Found that when running with np and ppr closer to the limit, but still over, there is a more graceful abort.
command ran: openmpi-v4.1.2/bin/mpirun -np 192 --map-by ppr:96:node -host hostA:96,hostB:96 --bind-to core --display-map --tag-output --allow-run-as-root --mca mtl ofi --mca btl ofi -x LD_LIBRARY_PATH=path/to/opx/build -x FI_PROVIDER=opx FI_LOG_LEVEL=warn -x IMB-MPI1 -include Uniband,Biband -npmin 192 -iter 10000 -msglog 0:15
Backtrace found: