Closed rhalkyard closed 6 years ago
Addendum: the segfault occurred on a CLE 6.0 system. Running the same code on CLE 5.2, I get no segfault, but fi_endpoint()
returns 18446744073709551600, which doesn't quite seem right to me either.
maybe we'll get lucky and ofiwg/libfabric#3469 will help. I'll check that out. Thanks for reporting and providing a test case.
I don't see this segfault on cori. Are you using the upstream libfabric or the one from this github repo? Also, which configure options are you using?
Interesting. This is on crystal here at Cray, so it's possible there might be some discrepancy in the underlying software. For the record, my configure options are --enable-gni=yes --enable-debug cross_compiling=yes
, and I seem to get the segfault on both this libfabric and the upstream version
Looks like we did indeed get lucky with ofiwg/libfabric#3469. Applied that patch to this repo's libfabric, and it seems that the segfault goes away. I think we can consider this closed once that PR gets merged.
@rhalkyard can you verify with upstream master that this problem is fixed?
The gni provider in upstream master (and here) seems to be broken right now – I get a compile error while building. However, if I check out a few commits back (b86c958a4b615239ff7258c3a601b727455534a4 seems to be the issue) I no longer see the issue when running my reproducer under the same conditions.
If I set a basic set of hints to select the gni provider, and set
caps = FI_MSG
withoutFI_RMA
, I get a segfault when callingfi_endpoint()
, with the following stack trace.Adding
FI_RMA
tocaps
works around the issue, but as far as I can tell, it should be entirely valid to only requestFI_MSG
.git bisect
ing on libfabric points the finger at 9316e133 (in particular, the MR mode getting changed infi_alter_domain_attr()
), but I'm not sure whether that change in behavior is the root cause, or whether it's just revealing an underlying provider issue, so I thought I would report it here first – the issue does not show up under the verbs or sockets providers.A minimal reproducer is below: