Closed tylerjereddy closed 1 month ago
Also, pretty sure that in this case I rebuilt NVSHMEM against 1.18.1
and still had the same problem, so it isn't quite that simple for me to work around (I did change some other things like newer OpenMPI, I was on a release candidate of 5.x series prior, and using newer NVSHMEM now).
Probably the most useful thing is this--do I have a shot at debugging this/fixing this without basically needing to rebuild my whole dependency chain? If I could just adjust LD_LIBRARY_PATH
to somehow fix this that would be amazing, but I suspect I won't be so lucky.
Clean rebuild of dependency chain with libfabric at 1.18.1
, OpenMPI at v5.0.3
, and NVSHMEM at 2.10.1
produces a similar backtrace:
pmix
is at 4.2.9
.
Changing this from version 5
to 18
in the NVSHMEM source had no effect, same errors:
src/modules/transport/libfabric/libfabric.h:#define NVSHMEMT_LIBFABRIC_MIN_VER 18
(which gets called by FI_VERSION
).
I've simplified the reproducer to remove GROMACS entirely, using only the cuFFTMp example at: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/samples/r2c_c2r_slabs_GROMACS.
Interactive run script for 2 nodes (4 A100 GPUs each)
Diff on above Makefile:
Output:
@tylerjereddy We occasionally see similar segfault at finalization phase inside ucp_worker_destroy(), the exact reason has not been identified with the most like guess being some race-condition inside ucx.
I don't see any libfabric related symbols from your trace. I would suggest run with debug build of libfabric and ucx to help to locate where the segfault happens.
@j-xiong I swapped in debug version of ucx
and libfabric
and added the log below the fold. Also added was FI_LOG_LEVEL=debug
. This assumes that LD_LIBRARY_PATH
swapping is sufficient and that I don't need to rebuild/re-link things.
I'm seeing a segfault/backtrace for NVSHMEM ->
libfabric
->ucx
control flow for a 2-node test run of GROMACS on one of our supercomputers with OpenMPI5.0.2
on Cray Slingshot 11. I think what I'm really looking for is clear runtime error messages that tell me what is wrong (API, ABI, whatever version mismatches, etc.) before I ever get to a segfault. I've labelled this a bug on the sole basis that I shouldn't be able to segfault, but it could be that the error resides with i.e., the use offi_getinfo()
"upstream" of the segfault happening (i.e., that NVSHMEM should handle their runtime check differently?).I've talked to NVIDIA engineers about this, and the problem really isn't clear to them. I did some experiments with runtime swapping of
libfabric
versions. NVSHMEM was built from source againstlibfabric 1.20.1
fromspack
.Here is what happens if I use
libfabric
1.18.1
at runtime instead:/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available
and then a segfault atucx
again.The local C++ code they're using looks like this:
where those two version variables are set to
1
and5
, respectively. I know I've had some success in the past using libfabric1.18.1
if I build NVSHMEM against that directly and use it at runtime. Is there a good reason I wouldn't be able to use1.20.1
, and if so how should the NVSHMEM folks guard against it?