BUG, MAINT: segfaults through libfabric->ucx

tylerjereddy commented 2 months ago

I'm seeing a segfault/backtrace for NVSHMEM -> libfabric -> ucx control flow for a 2-node test run of GROMACS on one of our supercomputers with OpenMPI 5.0.2 on Cray Slingshot 11. I think what I'm really looking for is clear runtime error messages that tell me what is wrong (API, ABI, whatever version mismatches, etc.) before I ever get to a segfault. I've labelled this a bug on the sole basis that I shouldn't be able to segfault, but it could be that the error resides with i.e., the use of fi_getinfo() "upstream" of the segfault happening (i.e., that NVSHMEM should handle their runtime check differently?).

``` /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. [nid001225:49728:0:49728] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18) ==== backtrace (tid: 49728) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(ucs_handle_error+0x294) [0x14d6611e4394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x30564) [0x14d6611e4564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/libucs.so.0(+0x3082e) [0x14d6611e482e] 3 /lib64/libpthread.so.0(+0x168c0) [0x14d6774918c0] 4 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.20.1-ftn27snaykegoaq57c2e4khavl6jfzy7/lib/libfabric.so.1(fi_dupinfo+0x35b) [0x14d5d87d3bab] 5 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.20.1-ftn27snaykegoaq57c2e4khavl6jfzy7/lib/libfabric.so.1(fi_dupinfo+0x206) [0x14d5d87d9ac6] 6 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.20.1-ftn27snaykegoaq57c2e4khavl6jfzy7/lib/libfabric.so.1(fi_getinfo+0x2c) [0x14d5d87d9b9c] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0xe8b) [0x14d5d8880fcb] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14d666414c89] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14d66641a01c] 10 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14d66641be99] 11 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14d66641c30e] 12 /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x52e635) [0x14d66878b635] 13 /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x11b4c3) [0x14d6683784c3] 14 /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14d6683a4450] 15 /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14d6683a5105] 16 /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14d6683a2d7d] 17 /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14d6683a31bf] 18 /lustre/scratch5/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14d6683a3386] 19 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFft11ImplCuFftMpC2EbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0xc89) [0x14d6795edaa9] 20 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFftC1ENS_10FftBackendEbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0x193) [0x14d6795eea93] 21 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z20pme_gpu_reinit_3dfftPK6PmeGpu+0x53b) [0x14d6795e945b] 22 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z14pme_gpu_reinitP9gmx_pme_tPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramb+0x21b) [0x14d6795e9e8b] 23 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z12gmx_pme_initPK9t_commrecRK13NumPmeDomainsPK10t_inputrecPA3_Kffbbbffi10PmeRunModeP6PmeGpuPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramRKN3gmx8MDLoggerE+0xafd) [0x14d67943da0d] 24 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Mdrunner8mdrunnerEv+0x646e) [0x14d6794bf31e] 25 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x408f17] 26 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409029] 27 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x33a) [0x14d678cd660a] 28 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405bcc] 29 /lib64/libc.so.6(__libc_start_main+0xef) [0x14d6770bb29d] 30 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405c3a] ================================= ```

I've talked to NVIDIA engineers about this, and the problem really isn't clear to them. I did some experiments with runtime swapping of libfabric versions. NVSHMEM was built from source against libfabric 1.20.1 from spack.

Here is what happens if I use libfabric 1.18.1 at runtime instead: /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available and then a segfault at ucx again.

The local C++ code they're using looks like this:

1521     status = fi_getinfo(FI_VERSION(NVSHMEMT_LIBFABRIC_MAJ_VER, NVSHMEMT_LIBFABRIC_MIN_VER), NULL,
1522                         NULL, 0, &info, &returned_fabrics);
1523                 
1524     NVSHMEMI_NZ_ERROR_JMP(status, NVSHMEMX_ERROR_INTERNAL, out,
1525                           "No providers matched fi_getinfo query: %d: %s\n", status,
1526                           fi_strerror(status * -1));

where those two version variables are set to 1 and 5, respectively. I know I've had some success in the past using libfabric 1.18.1 if I build NVSHMEM against that directly and use it at runtime. Is there a good reason I wouldn't be able to use 1.20.1, and if so how should the NVSHMEM folks guard against it?

tylerjereddy commented 2 months ago

Also, pretty sure that in this case I rebuilt NVSHMEM against 1.18.1 and still had the same problem, so it isn't quite that simple for me to work around (I did change some other things like newer OpenMPI, I was on a release candidate of 5.x series prior, and using newer NVSHMEM now).

Probably the most useful thing is this--do I have a shot at debugging this/fixing this without basically needing to rebuild my whole dependency chain? If I could just adjust LD_LIBRARY_PATH to somehow fix this that would be amazing, but I suspect I won't be so lucky.

tylerjereddy commented 2 months ago

Clean rebuild of dependency chain with libfabric at 1.18.1, OpenMPI at v5.0.3, and NVSHMEM at 2.10.1 produces a similar backtrace:

``` nid001481:109123:109123 [3] NVSHMEM INFO [1] heap base: 0x14c9e0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001481:109123:109123 [3] NVSHMEM INFO [1] mspace ptr: 0x14cba6908340 nid001480:5921:5921 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001481:109123:109123 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available [nid001481:109123:0:109123] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) ==== backtrace (tid: 109123) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14cb9f892394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14cb9f892564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14cb9f89282e] 3 /lib64/libpthread.so.0(+0x168c0) [0x14cbb64cb8c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14cb2022f9fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14cb20231312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14cba4ac2c89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14cba4ac801c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14cba4ac9e99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14cba4aca30e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x52e635) [0x14cba6e39635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x11b4c3) [0x14cba6a264c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14cba6a52450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14cba6a53105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14cba6a50d7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14cba6a511bf] 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14cba6a51386] 17 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFft11ImplCuFftMpC2EbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0xc89) [0x14cbb7c9c029] 18 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFftC1ENS_10FftBackendEbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0x193) [0x14cbb7c9d013] 19 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z20pme_gpu_reinit_3dfftPK6PmeGpu+0x53b) [0x14cbb7c979bb] 20 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z14pme_gpu_reinitP9gmx_pme_tPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramb+0x21b) [0x14cbb7c983eb] 21 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z12gmx_pme_initPK9t_commrecRK13NumPmeDomainsPK10t_inputrecPA3_Kffbbbffi10PmeRunModeP6PmeGpuPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramRKN3gmx8MDLoggerE+0xafd) [0x14cbb7aebf5d] 22 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Mdrunner8mdrunnerEv+0x646e) [0x14cbb7b6d87e] 23 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409027] 24 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409139] 25 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x33a) [0x14cbb73846ba] 26 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405cdc] 27 /lib64/libc.so.6(__libc_start_main+0xef) [0x14cbb576929d] 28 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405d4a] ================================= /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available [nid001480:5921 :0:5921] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) ==== backtrace (tid: 5921) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x147123185394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x147123185564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14712318582e] 3 /lib64/libpthread.so.0(+0x168c0) [0x147139dbe8c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1470a48249fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1470a4826312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x1471283b5c89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x1471283bb01c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x1471283bce99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x1471283bd30e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x52e635) [0x14712a72c635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(+0x11b4c3) [0x14712a3194c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14712a345450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14712a346105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14712a343d7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14712a3441bf] 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/lib64/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14712a344386] 17 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFft11ImplCuFftMpC2EbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0xc89) [0x14713b58f029] 18 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Gpu3dFftC1ENS_10FftBackendEbP19ompi_communicator_tNS_8ArrayRefIKiEES6_ibRK13DeviceContextRK12DeviceStreamPiSD_SD_PPfSF_+0x193) [0x14713b590013] 19 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z20pme_gpu_reinit_3dfftPK6PmeGpu+0x53b) [0x14713b58a9bb] 20 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z14pme_gpu_reinitP9gmx_pme_tPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramb+0x21b) [0x14713b58b3eb] 21 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_Z12gmx_pme_initPK9t_commrecRK13NumPmeDomainsPK10t_inputrecPA3_Kffbbbffi10PmeRunModeP6PmeGpuPK13DeviceContextPK12DeviceStreamPK13PmeGpuProgramRKN3gmx8MDLoggerE+0xafd) [0x14713b3def5d] 22 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx8Mdrunner8mdrunnerEv+0x646e) [0x14713b46087e] 23 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409027] 24 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x409139] 25 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/lib64/libgromacs_mpi.so.9(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x33a) [0x14713ac776ba] 26 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405cdc] 27 /lib64/libc.so.6(__libc_start_main+0xef) [0x14713905c29d] 28 /lustre/scratch5/treddy/march_april_2024_testing/gmx_install/bin/gmx_mpi() [0x405d4a] ================================= -------------------------------------------------------------------------- This help section is empty because PRRTE was built without Sphinx. -------------------------------------------------------------------------- ```

pmix is at 4.2.9.

tylerjereddy commented 2 months ago

Changing this from version 5 to 18 in the NVSHMEM source had no effect, same errors: src/modules/transport/libfabric/libfabric.h:#define NVSHMEMT_LIBFABRIC_MIN_VER 18 (which gets called by FI_VERSION).

tylerjereddy commented 2 months ago

I've simplified the reproducer to remove GROMACS entirely, using only the cuFFTMp example at: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/samples/r2c_c2r_slabs_GROMACS.

Interactive run script for 2 nodes (4 A100 GPUs each)

```bash #!/bin/bash -l # # setup the runtime environment export NVSHMEM_DEBUG=TRACE export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u/lib:$LD_LIBRARY_PATH" export PATH="$PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin" export PATH="$PATH:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin" export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/ucx:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:$LD_LIBRARY_PATH" export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" export NVSHMEM_DISABLE_CUDA_VMM=1 export FI_CXI_OPTIMIZED_MRS=false export NVSHMEM_REMOTE_TRANSPORT=libfabric export MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install export CUFFT_LIB=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib export CUFFT_INC=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp export NVSHMEM_LIB=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib export NVSHMEM_INC=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include cd /lustre/scratch5/treddy/march_april_2024_testing/github_projects/CUDALibrarySamples/cuFFTMp/samples/r2c_c2r_slabs_GROMACS make clean make build make run ```

Diff on above Makefile:

```diff diff --git a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile index 5d9fa3e..64e39be 100644 --- a/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile +++ b/cuFFTMp/samples/r2c_c2r_slabs_GROMACS/Makefile @@ -15,4 +15,4 @@ $(exe): $(exe).cu build: $(exe) run: $(exe) - LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" mpirun -oversubscribe -n 4 $(exe) + LD_LIBRARY_PATH="${NVSHMEM_LIB}:${CUFFT_LIB}:${LD_LIBRARY_PATH}" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 $(exe) ```

Output:

``` rm -rf cufftmp_r2c_c2r_slabs_GROMACS /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin/../bin/nvcc cufftmp_r2c_c2r_slabs_GROMACS.cu -o cufftmp_r2c_c2r_slabs_GROMACS -std=c++17 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_90,code=sm_90 -I/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp -I/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include -I/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/include -lcuda -L/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib -L/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib -lcufftMp -lnvshmem_device -lnvshmem_host -L/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib -lmpi LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib:/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/ucx:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u/lib:/opt/cray/pe/papi/7.0.0.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 cufftmp_r2c_c2r_slabs_GROMACS Hello from rank 7/8 using GPU 3 Hello from rank 6/8 using GPU 2 Hello from rank 5/8 using GPU 1 Hello from rank 4/8 using GPU 0 Hello from rank 2/8 using GPU 2 Hello from rank 3/8 using GPU 3 Hello from rank 0/8 using GPU 0 Hello from rank 1/8 using GPU 1 NVSHMEM configuration: CUDA API 12030 CUDA Runtime 12030 CUDA Driver 12000 Build Timestamp Apr 22 2024 12:58:35 Build Variables NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_ENV_ALL=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF NVSHMEM_IBGDA_SUPPORT=OFF NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON NVSHMEM_LIBFABRIC_SUPPORT=ON NVSHMEM_MPI_SUPPORT=ON NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF NVSHMEM_USE_GDRCOPY=ON NVSHMEM_VERBOSE=OFF CUDA_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3 GDRCOPY_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/gdrcopy-2.3-ftyzikjaithdoznahhzpuecguynyqqyv LIBFABRIC_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u nid001233:91419:91419 [3] NVSHMEM INFO PE 3 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001233:91417:91417 [1] NVSHMEM INFO PE 1 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001233:91418:91418 [2] NVSHMEM INFO PE 2 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install NCCL_HOME=/usr/local/nccl NVSHMEM_PREFIX=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install PMIX_HOME=/usr SHMEM_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install UCX_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z nid001233:91416:91416 [0] NVSHMEM INFO PE 0 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001233:91417:91417 [1] NVSHMEM INFO cudaDriverVersion 12000 nid001233:91417:91417 [1] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001233:91417:91417 [1] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001233:91417:91417 [1] NVSHMEM INFO in get_cucontext, queried and saved context for device: 1 context: 0x24444b0 nid001233:91418:91418 [2] NVSHMEM INFO cudaDriverVersion 12000 nid001233:91418:91418 [2] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001233:91418:91418 [2] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001233:91418:91418 [2] NVSHMEM INFO in get_cucontext, queried and saved context for device: 2 context: 0x243c0f0 nid001233:91416:91416 [0] NVSHMEM INFO cudaDriverVersion 12000 nid001233:91419:91419 [3] NVSHMEM INFO cudaDriverVersion 12000 nid001233:91416:91416 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001233:91416:91416 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001233:91416:91416 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x244c870 nid001233:91419:91419 [3] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001233:91419:91419 [3] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001233:91419:91419 [3] NVSHMEM INFO in get_cucontext, queried and saved context for device: 3 context: 0x2433d30 nid001233:91417:91417 [1] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c32c0 nid001233:91418:91418 [2] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c36a0 nid001233:91416:91416 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3fe0 nid001233:91419:91419 [3] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c2ff0 nid001417:92704:92704 [3] NVSHMEM INFO PE 7 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001417:92703:92703 [2] NVSHMEM INFO PE 6 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001417:92701:92701 [0] NVSHMEM INFO PE 4 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001417:92702:92702 [1] NVSHMEM INFO PE 5 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001417:92703:92703 [2] NVSHMEM INFO cudaDriverVersion 12000 nid001417:92703:92703 [2] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001417:92703:92703 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001417:92703:92703 [2] NVSHMEM INFO in get_cucontext, queried and saved context for device: 2 context: 0x243bfe0 nid001417:92704:92704 [3] NVSHMEM INFO cudaDriverVersion 12000 nid001417:92704:92704 [3] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001417:92702:92702 [1] NVSHMEM INFO cudaDriverVersion 12000 nid001417:92702:92702 [1] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001417:92702:92702 [1] NVSHMEM INFO [5] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001417:92702:92702 [1] NVSHMEM INFO in get_cucontext, queried and saved context for device: 1 context: 0x24443a0 nid001417:92701:92701 [0] NVSHMEM INFO cudaDriverVersion 12000 nid001417:92701:92701 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001417:92701:92701 [0] NVSHMEM INFO [4] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001417:92701:92701 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x244c760 nid001417:92704:92704 [3] NVSHMEM INFO [7] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001417:92704:92704 [3] NVSHMEM INFO in get_cucontext, queried and saved context for device: 3 context: 0x2433c20 nid001417:92703:92703 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3010 nid001417:92704:92704 [3] NVSHMEM INFO [7] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3200 nid001417:92701:92701 [0] NVSHMEM INFO [4] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3850 nid001417:92702:92702 [1] NVSHMEM INFO [5] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3160 nid001233:91416:91416 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001233:91418:91418 [2] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001233:91419:91419 [3] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001417:92703:92703 [2] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001233:91417:91417 [1] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001417:92701:92701 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001417:92704:92704 [3] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001417:92702:92702 [1] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001233:91418:91418 [2] NVSHMEM INFO [2] heap base: 0x145b60000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001233:91418:91418 [2] NVSHMEM INFO [2] mspace ptr: 0x145c5e3c8340 nid001233:91416:91416 [0] NVSHMEM INFO [0] heap base: 0x1493c0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001233:91416:91416 [0] NVSHMEM INFO [0] mspace ptr: 0x1494b6f4a340 nid001233:91419:91419 [3] NVSHMEM INFO [3] heap base: 0x147540000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001233:91419:91419 [3] NVSHMEM INFO [3] mspace ptr: 0x147640955340 nid001233:91417:91417 [1] NVSHMEM INFO [1] heap base: 0x145740000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001233:91417:91417 [1] NVSHMEM INFO [1] mspace ptr: 0x14584210a340 nid001417:92703:92703 [2] NVSHMEM INFO [6] heap base: 0x14b6a0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001417:92703:92703 [2] NVSHMEM INFO [6] mspace ptr: 0x14b7a273a340 nid001417:92701:92701 [0] NVSHMEM INFO [4] heap base: 0x14fc60000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001417:92701:92701 [0] NVSHMEM INFO [4] mspace ptr: 0x14fd620fd340 nid001417:92704:92704 [3] NVSHMEM INFO [7] heap base: 0x145460000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001417:92704:92704 [3] NVSHMEM INFO [7] mspace ptr: 0x145554875340 nid001417:92702:92702 [1] NVSHMEM INFO [5] heap base: 0x14d120000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001417:92702:92702 [1] NVSHMEM INFO [5] mspace ptr: 0x14d212cc9340 nid001417:92703:92703 [2] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001417:92702:92702 [1] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001417:92701:92701 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001417:92704:92704 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001233:91417:91417 [1] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001233:91418:91418 [2] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001233:91419:91419 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001233:91416:91416 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available [nid001417:92703:0:92703] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available [nid001417:92702:0:92702] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) [nid001417:92704:0:92704] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) [nid001417:92701:0:92701] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available [nid001233:91416:0:91416] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) [nid001233:91418:0:91418] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) [nid001233:91419:0:91419] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) [nid001233:91417:0:91417] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) ==== backtrace (tid: 91419) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14763c255394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14763c255564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14763c25582e] 3 /lib64/libpthread.so.0(+0x168c0) [0x14763e27c8c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1476240ec9fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1476240ee312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14763eb0fc89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14763eb1501c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14763eb16e99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14763eb1730e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x147640e86635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x147640a734c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x147640a9f450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x147640aa0105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x147640a9dd7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x147640a9e1bf] ==== backtrace (tid: 91416) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x1494b284a394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x1494b284a564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x1494b284a82e] 3 /lib64/libpthread.so.0(+0x168c0) [0x1494b48718c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1494a01069fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1494a0108312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x1494b5104c89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x1494b510a01c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x1494b510be99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x1494b510c30e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x1494b747b635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x1494b70684c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x1494b7094450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x1494b7095105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x1494b7092d7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x1494b70931bf] ==== backtrace (tid: 91418) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x145c59cc8394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x145c59cc8564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x145c59cc882e] 3 /lib64/libpthread.so.0(+0x168c0) [0x145c5bcef8c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x145c4c2989fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x145c4c29a312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x145c5c582c89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x145c5c58801c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x145c5c589e99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x145c5c58a30e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x145c5e8f9635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x145c5e4e64c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x145c5e512450] ==== backtrace (tid: 91417) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14583da0a394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14583da0a564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14583da0a82e] 3 /lib64/libpthread.so.0(+0x168c0) [0x14583fa318c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14581432d9fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14581432f312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x1458402c4c89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x1458402ca01c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x1458402cbe99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x1458402cc30e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14584263b635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x1458422284c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x145842254450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x145842255105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x145842252d7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x1458422531bf] 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x147640a9e386] 17 cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9] 18 cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7] 19 /lib64/libc.so.6(__libc_start_main+0xef) [0x14763d31529d] 20 cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a] ================================= 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x1494b7093386] 17 cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9] 18 cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7] 19 /lib64/libc.so.6(__libc_start_main+0xef) [0x1494b390a29d] 20 cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a] ================================= 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x145c5e513105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x145c5e510d7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x145c5e5111bf] 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x145c5e511386] 17 cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9] 18 cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7] 19 /lib64/libc.so.6(__libc_start_main+0xef) [0x145c5ad8829d] 20 cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a] ================================= 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x145842253386] 17 cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9] 18 cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7] 19 /lib64/libc.so.6(__libc_start_main+0xef) [0x14583eaca29d] 20 cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a] ================================= ==== backtrace (tid: 92702) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14d20e5c9394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14d20e5c9564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14d20e5c982e] 3 /lib64/libpthread.so.0(+0x168c0) [0x14d2105f08c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14d1e03d89fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14d1e03da312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14d210e83c89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14d210e8901c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14d210e8ae99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14d210e8b30e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14d2131fa635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14d212de74c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14d212e13450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14d212e14105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14d212e11d7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14d212e121bf] 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14d212e12386] 17 cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9] 18 cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7] 19 /lib64/libc.so.6(__libc_start_main+0xef) [0x14d20f68929d] 20 cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a] ================================= ==== backtrace (tid: 92703) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14b79e03a394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14b79e03a564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14b79e03a82e] 3 /lib64/libpthread.so.0(+0x168c0) [0x14b7a00618c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14b7723cd9fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14b7723cf312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14b7a08f4c89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14b7a08fa01c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14b7a08fbe99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14b7a08fc30e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14b7a2c6b635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14b7a28584c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14b7a2884450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14b7a2885105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14b7a2882d7d] ==== backtrace (tid: 92701) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x14fd5d9fd394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x14fd5d9fd564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14fd5d9fd82e] 3 /lib64/libpthread.so.0(+0x168c0) [0x14fd5fa248c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14fd34b2d9fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x14fd34b2f312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x14fd602b7c89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x14fd602bd01c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x14fd602bee99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x14fd602bf30e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x14fd6262e635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x14fd6221b4c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x14fd62247450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x14fd62248105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x14fd62245d7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14fd622461bf] 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14fd62246386] 17 cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9] 18 cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7] 19 /lib64/libc.so.6(__libc_start_main+0xef) [0x14fd5eabd29d] 20 cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a] ================================= ==== backtrace (tid: 92704) ==== 0 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(ucs_handle_error+0x294) [0x145550175394] 1 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x30564) [0x145550175564] 2 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z/lib/libucs.so.0(+0x3082e) [0x14555017582e] 3 /lib64/libpthread.so.0(+0x168c0) [0x14555219c8c0] 4 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(+0xb9fd) [0x1454eca479fd] 5 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/nvshmem_transport_libfabric.so.1(nvshmemt_init+0x11d2) [0x1454eca49312] 6 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xe6c89) [0x145552a2fc89] 7 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xec01c) [0x145552a3501c] 8 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(+0xede99) [0x145552a36e99] 9 /lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib/libnvshmem_host.so.2(nvshmemx_host_init+0xae) [0x145552a3730e] 10 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x52e635) [0x145554da6635] 11 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(+0x11b4c3) [0x1455549934c3] 12 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanGuru64+0x120) [0x1455549bf450] 13 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftXtMakePlanMany+0x135) [0x1455549c0105] 14 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany64+0x5d) [0x1455549bdd7d] 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x1455549be1bf] 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x1455549be386] 17 cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9] 18 cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7] 19 /lib64/libc.so.6(__libc_start_main+0xef) [0x14555123529d] 20 cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a] ================================= 15 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlanMany+0x10f) [0x14b7a28831bf] 16 /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib/libcufftMp.so.11(cufftMakePlan3d+0x36) [0x14b7a2883386] 17 cufftmp_r2c_c2r_slabs_GROMACS() [0x4067f9] 18 cufftmp_r2c_c2r_slabs_GROMACS() [0x407ba7] 19 /lib64/libc.so.6(__libc_start_main+0xef) [0x14b79f0fa29d] 20 cufftmp_r2c_c2r_slabs_GROMACS() [0x40591a] ================================= -------------------------------------------------------------------------- This help section is empty because PRRTE was built without Sphinx. -------------------------------------------------------------------------- make: *** [Makefile:18: run] Error 139 ```

j-xiong commented 2 months ago

@tylerjereddy We occasionally see similar segfault at finalization phase inside ucp_worker_destroy(), the exact reason has not been identified with the most like guess being some race-condition inside ucx.

I don't see any libfabric related symbols from your trace. I would suggest run with debug build of libfabric and ucx to help to locate where the segfault happens.

tylerjereddy commented 2 months ago

@j-xiong I swapped in debug version of ucx and libfabric and added the log below the fold. Also added was FI_LOG_LEVEL=debug. This assumes that LD_LIBRARY_PATH swapping is sufficient and that I don't need to rebuild/re-link things.

``` rm -rf cufftmp_r2c_c2r_slabs_GROMACS /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin/../bin/nvcc cufftmp_r2c_c2r_slabs_GROMACS.cu -o cufftmp_r2c_c2r_slabs_GROMACS -std=c++17 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_90,code=sm_90 -I/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp -I/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include -I/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/include -lcuda -L/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib -L/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib -lcufftMp -lnvshmem_device -lnvshmem_host -L/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib -lmpi LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib:/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib/ucx:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ddcazrdaa6qbatqyqgcr34p5ejdc47as/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-val5ydwfbxqr7fvv5xpryk73qkxatlrg/lib:/opt/cray/pe/papi/7.0.0.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 cufftmp_r2c_c2r_slabs_GROMACS Hello from rank 3/8 using GPU 3 Hello from rank 2/8 using GPU 2 Hello from rank 1/8 using GPU 1 Hello from rank 0/8 using GPU 0 Hello from rank 4/8 using GPU 0 Hello from rank 6/8 using GPU 2 Hello from rank 7/8 using GPU 3 Hello from rank 5/8 using GPU 1 NVSHMEM configuration: CUDA API 12030 CUDA Runtime 12030 CUDA Driver 12000 Build Timestamp Apr 22 2024 12:58:35 Build Variables NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_ENV_ALL=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF NVSHMEM_IBGDA_SUPPORT=OFF NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON NVSHMEM_LIBFABRIC_SUPPORT=ON NVSHMEM_MPI_SUPPORT=ON NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF NVSHMEM_USE_GDRCOPY=ON NVSHMEM_VERBOSE=OFF CUDA_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3 GDRCOPY_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/gdrcopy-2.3-ftyzikjaithdoznahhzpuecguynyqqyv LIBFABRIC_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/libfabric-1.18.1-opv2jutclmudyzxdeud4xjggqrubip3u MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install NCCL_HOME=/usr/local/nccl NVSHMEM_PREFIX=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install PMIX_HOME=/usr SHMEM_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install UCX_HOME=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-jgvunnzolzn5rwsnl3e7pbak7fir4c2z nid001500:60834:60834 [3] NVSHMEM INFO PE 3 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001500:60833:60833 [2] NVSHMEM INFO PE 2 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001500:60832:60832 [1] NVSHMEM INFO PE 1 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001500:60831:60831 [0] NVSHMEM INFO PE 0 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001504:71717:71717 [2] NVSHMEM INFO PE 6 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001504:71715:71715 [0] NVSHMEM INFO PE 4 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001504:71716:71716 [1] NVSHMEM INFO PE 5 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001504:71718:71718 [3] NVSHMEM INFO PE 7 (process) affinity to 128 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 nid001500:60833:60833 [2] NVSHMEM INFO cudaDriverVersion 12000 nid001500:60833:60833 [2] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001500:60833:60833 [2] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001500:60833:60833 [2] NVSHMEM INFO in get_cucontext, queried and saved context for device: 2 context: 0x243c110 nid001500:60834:60834 [3] NVSHMEM INFO cudaDriverVersion 12000 nid001500:60834:60834 [3] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001500:60834:60834 [3] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001500:60834:60834 [3] NVSHMEM INFO in get_cucontext, queried and saved context for device: 3 context: 0x2433d50 nid001500:60832:60832 [1] NVSHMEM INFO cudaDriverVersion 12000 nid001500:60832:60832 [1] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001500:60832:60832 [1] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001500:60832:60832 [1] NVSHMEM INFO in get_cucontext, queried and saved context for device: 1 context: 0x24444d0 nid001504:71715:71715 [0] NVSHMEM INFO cudaDriverVersion 12000 nid001504:71715:71715 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001504:71715:71715 [0] NVSHMEM INFO [4] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001500:60833:60833 [2] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c32c0 nid001500:60834:60834 [3] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3160 nid001500:60831:60831 [0] NVSHMEM INFO cudaDriverVersion 12000 nid001500:60832:60832 [1] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3270 nid001500:60831:60831 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001500:60831:60831 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001500:60831:60831 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x244c890 nid001504:71715:71715 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x244c770 nid001504:71718:71718 [3] NVSHMEM INFO cudaDriverVersion 12000 nid001504:71718:71718 [3] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001504:71717:71717 [2] NVSHMEM INFO cudaDriverVersion 12000 nid001504:71717:71717 [2] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001504:71718:71718 [3] NVSHMEM INFO [7] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001504:71718:71718 [3] NVSHMEM INFO in get_cucontext, queried and saved context for device: 3 context: 0x2433c30 nid001500:60831:60831 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3f20 nid001504:71717:71717 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001504:71717:71717 [2] NVSHMEM INFO in get_cucontext, queried and saved context for device: 2 context: 0x243bff0 nid001504:71716:71716 [1] NVSHMEM INFO cudaDriverVersion 12000 nid001504:71716:71716 [1] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected nid001504:71716:71716 [1] NVSHMEM INFO [5] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil) nid001504:71716:71716 [1] NVSHMEM INFO in get_cucontext, queried and saved context for device: 1 context: 0x24443b0 nid001504:71715:71715 [0] NVSHMEM INFO [4] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c3990 nid001504:71718:71718 [3] NVSHMEM INFO [7] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c2cb0 nid001504:71717:71717 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c34c0 nid001504:71716:71716 [1] NVSHMEM INFO [5] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x40c2ea0 nid001500:60833:60833 [2] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001500:60831:60831 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001500:60834:60834 [3] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001500:60832:60832 [1] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001504:71717:71717 [2] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001504:71715:71715 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001504:71718:71718 [3] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001504:71716:71716 [1] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000 nid001504:71717:71717 [2] NVSHMEM INFO [6] heap base: 0x147a00000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001504:71717:71717 [2] NVSHMEM INFO [6] mspace ptr: 0x147afba7c340 nid001504:71718:71718 [3] NVSHMEM INFO [7] heap base: 0x14c3c0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001504:71718:71718 [3] NVSHMEM INFO [7] mspace ptr: 0x14c4b2073340 nid001504:71716:71716 [1] NVSHMEM INFO [5] heap base: 0x14d700000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001504:71716:71716 [1] NVSHMEM INFO [5] mspace ptr: 0x14d7f1ab7340 nid001504:71715:71715 [0] NVSHMEM INFO [4] heap base: 0x1457c0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001504:71715:71715 [0] NVSHMEM INFO [4] mspace ptr: 0x1458c7180340 nid001500:60833:60833 [2] NVSHMEM INFO [2] heap base: 0x14e9e0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001500:60833:60833 [2] NVSHMEM INFO [2] mspace ptr: 0x14eaccb8e340 nid001500:60831:60831 [0] NVSHMEM INFO [0] heap base: 0x14e060000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001500:60831:60831 [0] NVSHMEM INFO [0] mspace ptr: 0x14e155715340 nid001500:60834:60834 [3] NVSHMEM INFO [3] heap base: 0x14e0e0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001500:60834:60834 [3] NVSHMEM INFO [3] mspace ptr: 0x14e1e421c340 nid001500:60832:60832 [1] NVSHMEM INFO [1] heap base: 0x1517e0000000 NVSHMEM_SYMMETRIC_SIZE 1073741824 total 2147483648 heapextra 285225000 nid001500:60832:60832 [1] NVSHMEM INFO [1] mspace ptr: 0x1518d516b340 nid001504:71715:71715 [0] NVSHMEM INFO IBRC transport skipped in favor of: libfabric nid001504:71718:71718 [3] NVSHMEM INFO IBRC transport skipped in favor of: libfabric /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var perf_cntr libfabric:71715:1713994966::core:core:fi_param_get_():372 variable perf_cntr= libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var hook libfabric:71715:1713994966::core:core:fi_param_get_():372 variable hook= libfabric:71715:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_CUDA not supported libfabric:71715:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_ROCR not supported libfabric:71715:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_ZE not supported libfabric:71715:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_NEURON not supported libfabric:71715:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_SYNAPSEAI not supported libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var hmem_disable_p2p libfabric:71715:1713994966::core:core:fi_param_get_():372 variable hmem_disable_p2p= libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var mr_cache_max_size libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var mr_cache_max_count libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var mr_cache_monitor libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var mr_cuda_cache_monitor_enabled libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var mr_rocr_cache_monitor_enabled libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var mr_ze_cache_monitor_enabled libfabric:71715:1713994966::core:core:fi_param_get_():372 variable mr_cache_max_size= libfabric:71715:1713994966::core:core:fi_param_get_():372 variable mr_cache_max_count= libfabric:71715:1713994966::core:core:fi_param_get_():372 variable mr_cache_monitor= libfabric:71715:1713994966::core:core:fi_param_get_():372 variable mr_cuda_cache_monitor_enabled= libfabric:71715:1713994966::core:core:fi_param_get_():372 variable mr_rocr_cache_monitor_enabled= libfabric:71715:1713994966::core:core:fi_param_get_():372 variable mr_ze_cache_monitor_enabled= libfabric:71715:1713994966::core:mr:ofi_default_cache_size():78 default cache size=1053667248 libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var provider libfabric:71715:1713994966::core:core:fi_param_get_():372 variable provider= libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var fork_unsafe libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var universe_size libfabric:71715:1713994966::core:core:fi_param_get_():372 variable universe_size= libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var av_remove_cleanup libfabric:71715:1713994966::core:core:fi_param_get_():372 variable av_remove_cleanup= libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var offload_coll_provider libfabric:71715:1713994966::core:core:fi_param_get_():372 variable offload_coll_provider= libfabric:71715:1713994966::core:core:fi_param_define_():251 registered var provider_path libfabric:71715:1713994966::core:core:fi_param_get_():372 variable provider_path= libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libefa-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm2-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libopx-fi.so /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var perf_cntr libfabric:71718:1713994966::core:core:fi_param_get_():372 variable perf_cntr= libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var hook libfabric:71718:1713994966::core:core:fi_param_get_():372 variable hook= libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm-fi.so libfabric:71718:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_CUDA not supported libfabric:71718:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_ROCR not supported libfabric:71718:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_ZE not supported libfabric:71718:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_NEURON not supported libfabric:71718:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_SYNAPSEAI not supported libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var hmem_disable_p2p libfabric:71718:1713994966::core:core:fi_param_get_():372 variable hmem_disable_p2p= libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var mr_cache_max_size libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var mr_cache_max_count libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var mr_cache_monitor libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var mr_cuda_cache_monitor_enabled libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var mr_rocr_cache_monitor_enabled libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var mr_ze_cache_monitor_enabled libfabric:71718:1713994966::core:core:fi_param_get_():372 variable mr_cache_max_size= libfabric:71718:1713994966::core:core:fi_param_get_():372 variable mr_cache_max_count= libfabric:71718:1713994966::core:core:fi_param_get_():372 variable mr_cache_monitor= libfabric:71718:1713994966::core:core:fi_param_get_():372 variable mr_cuda_cache_monitor_enabled= libfabric:71718:1713994966::core:core:fi_param_get_():372 variable mr_rocr_cache_monitor_enabled= libfabric:71718:1713994966::core:core:fi_param_get_():372 variable mr_ze_cache_monitor_enabled= libfabric:71718:1713994966::core:mr:ofi_default_cache_size():78 default cache size=1053667248 libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var provider libfabric:71718:1713994966::core:core:fi_param_get_():372 variable provider= libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var fork_unsafe libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var universe_size libfabric:71718:1713994966::core:core:fi_param_get_():372 variable universe_size= libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var av_remove_cleanup libfabric:71718:1713994966::core:core:fi_param_get_():372 variable av_remove_cleanup= libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var offload_coll_provider libfabric:71718:1713994966::core:core:fi_param_get_():372 variable offload_coll_provider= libfabric:71718:1713994966::core:core:fi_param_define_():251 registered var provider_path libfabric:71718:1713994966::core:core:fi_param_get_():372 variable provider_path= libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libefa-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libusnic-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libgni-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm2-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libbgq-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libopx-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libverbs-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm-fi.so nid001504:71717:71717 [2] NVSHMEM INFO IBRC transport skipped in favor of: libfabric libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libnetdir-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libusnic-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm3-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libgni-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libucx-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libbgq-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib librxm-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libverbs-fi.so /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib librxd-fi.so libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var perf_cntr libfabric:71717:1713994966::core:core:fi_param_get_():372 variable perf_cntr= libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libnetdir-fi.so libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var hook libfabric:71717:1713994966::core:core:fi_param_get_():372 variable hook= libfabric:71717:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_CUDA not supported libfabric:71717:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_ROCR not supported libfabric:71717:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_ZE not supported libfabric:71717:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_NEURON not supported libfabric:71717:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_SYNAPSEAI not supported libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var hmem_disable_p2p libfabric:71717:1713994966::core:core:fi_param_get_():372 variable hmem_disable_p2p= libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var mr_cache_max_size libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var mr_cache_max_count libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var mr_cache_monitor libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var mr_cuda_cache_monitor_enabled libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var mr_rocr_cache_monitor_enabled libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var mr_ze_cache_monitor_enabled libfabric:71717:1713994966::core:core:fi_param_get_():372 variable mr_cache_max_size= libfabric:71717:1713994966::core:core:fi_param_get_():372 variable mr_cache_max_count= libfabric:71717:1713994966::core:core:fi_param_get_():372 variable mr_cache_monitor= libfabric:71717:1713994966::core:core:fi_param_get_():372 variable mr_cuda_cache_monitor_enabled= libfabric:71717:1713994966::core:core:fi_param_get_():372 variable mr_rocr_cache_monitor_enabled= libfabric:71717:1713994966::core:core:fi_param_get_():372 variable mr_ze_cache_monitor_enabled= libfabric:71717:1713994966::core:mr:ofi_default_cache_size():78 default cache size=1053667248 libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var provider libfabric:71717:1713994966::core:core:fi_param_get_():372 variable provider= libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var fork_unsafe libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var universe_size libfabric:71717:1713994966::core:core:fi_param_get_():372 variable universe_size= libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var av_remove_cleanup libfabric:71717:1713994966::core:core:fi_param_get_():372 variable av_remove_cleanup= nid001504:71716:71716 [1] NVSHMEM INFO IBRC transport skipped in favor of: libfabric libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var offload_coll_provider libfabric:71717:1713994966::core:core:fi_param_get_():372 variable offload_coll_provider= libfabric:71717:1713994966::core:core:fi_param_define_():251 registered var provider_path libfabric:71717:1713994966::core:core:fi_param_get_():372 variable provider_path= libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libefa-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libshm-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm3-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libudp-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libucx-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm2-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libtcp-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libopx-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib librxm-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libsockets-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib librxd-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(librxd-fi.so): librxd-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libshm-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libusnic-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libnet-fi.so /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1691 GDRCopy requested, but unused by transport. Disabling. libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libshm-fi.so): libshm-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libudp-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libgni-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_perf-fi.so libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var perf_cntr libfabric:71716:1713994966::core:core:fi_param_get_():372 variable perf_cntr= libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var hook libfabric:71716:1713994966::core:core:fi_param_get_():372 variable hook= libfabric:71716:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_CUDA not supported libfabric:71716:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_ROCR not supported libfabric:71716:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_ZE not supported libfabric:71716:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_NEURON not supported libfabric:71716:1713994966::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_SYNAPSEAI not supported libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var hmem_disable_p2p libfabric:71716:1713994966::core:core:fi_param_get_():372 variable hmem_disable_p2p= libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var mr_cache_max_size libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var mr_cache_max_count libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var mr_cache_monitor libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var mr_cuda_cache_monitor_enabled libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var mr_rocr_cache_monitor_enabled libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var mr_ze_cache_monitor_enabled libfabric:71716:1713994966::core:core:fi_param_get_():372 variable mr_cache_max_size= libfabric:71716:1713994966::core:core:fi_param_get_():372 variable mr_cache_max_count= libfabric:71716:1713994966::core:core:fi_param_get_():372 variable mr_cache_monitor= libfabric:71716:1713994966::core:core:fi_param_get_():372 variable mr_cuda_cache_monitor_enabled= libfabric:71716:1713994966::core:core:fi_param_get_():372 variable mr_rocr_cache_monitor_enabled= libfabric:71716:1713994966::core:core:fi_param_get_():372 variable mr_ze_cache_monitor_enabled= libfabric:71716:1713994966::core:mr:ofi_default_cache_size():78 default cache size=1053667248 libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var provider libfabric:71716:1713994966::core:core:fi_param_get_():372 variable provider= libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var fork_unsafe libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var universe_size libfabric:71716:1713994966::core:core:fi_param_get_():372 variable universe_size= libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var av_remove_cleanup libfabric:71716:1713994966::core:core:fi_param_get_():372 variable av_remove_cleanup= libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var offload_coll_provider libfabric:71716:1713994966::core:core:fi_param_get_():372 variable offload_coll_provider= libfabric:71716:1713994966::core:core:fi_param_define_():251 registered var provider_path libfabric:71716:1713994966::core:core:fi_param_get_():372 variable provider_path= libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libefa-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libbgq-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libudp-fi.so): libudp-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libtcp-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_trace-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libtcp-fi.so): libtcp-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libsockets-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libbgq-fi.so): libbgq-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libverbs-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_debug-fi.so libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libefa-fi.so): libefa-fi.so: cannot open shared object file: No such file or directory libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm2-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libsockets-fi.so): libsockets-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libnet-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libverbs-fi.so): libverbs-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libnetdir-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_noop-fi.so libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm2-fi.so): libpsm2-fi.so: cannot open shared object file: No such file or directory libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libopx-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libnetdir-fi.so): libnetdir-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm3-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libnet-fi.so): libnet-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_perf-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_noop-fi.so): libhook_noop-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_hmem-fi.so libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libopx-fi.so): libopx-fi.so: cannot open shared object file: No such file or directory libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libpsm-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm3-fi.so): libpsm3-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libucx-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_perf-fi.so): libhook_perf-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_trace-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_hmem-fi.so): libhook_hmem-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_dmabuf_peer_mem-fi.so libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libpsm-fi.so): libpsm-fi.so: cannot open shared object file: No such file or directory libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libusnic-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_trace-fi.so): libhook_trace-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_debug-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib librxm-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_dmabuf_peer_mem-fi.so): libhook_dmabuf_peer_mem-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libcoll-fi.so libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libusnic-fi.so): libusnic-fi.so: cannot open shared object file: No such file or directory libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libgni-fi.so libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libhook_debug-fi.so): libhook_debug-fi.so: cannot open shared object file: No such file or directory libfabric:71718:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libhook_noop-fi.so libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(librxm-fi.so): librxm-fi.so: cannot open shared object file: No such file or directory libfabric:71717:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib librxd-fi.so libfabric:71715:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libcoll-fi.so): libcoll-fi.so: cannot open shared object file: No such file or directory libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::core:core:ofi_register_provider():461 no provider structure or name libfabric:71715:1713994966::udp:core:fi_param_define_():251 registered var iface libfabric:71715:1713994966::core:core:ofi_register_provider():466 registering provider: udp (118.10) libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var pe_waittime libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var conn_timeout libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var max_conn_retry libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var def_conn_map_sz libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var def_av_sz libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var def_cq_sz libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var def_eq_sz libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var pe_affinity libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var keepalive_enable libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():610 dlopen(libgni-fi.so): libgni-fi.so: cannot open shared object file: No such file or directory libfabric:71716:1713994966::core:core:ofi_reg_dl_prov():606 opening provider lib libbgq-fi.so libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var keepalive_time libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var keepalive_intvl libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var keepalive_probes libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var iface libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var max_buf_sz libfabric:71715:1713994966::sockets:core:fi_param_define_():251 registered var dgram_drop_rate libfabric:71715:1713994966::core:core:ofi_register_provider():466 registering provider: sockets (118.10) libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var prov_name libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable prov_name= libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var iface libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var port_low_range libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var port_high_range libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable port_high_range= libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable port_low_range= libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var tx_size libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var rx_size libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable tx_size= libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable rx_size= libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var max_inject libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable max_inject= libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var max_saved libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable max_saved= libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var max_saved_size libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable max_saved_size= libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var max_rx_size libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable max_rx_size= libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var nodelay libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable nodelay= libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var staging_sbuf_size libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var prefetch_rbuf_size libfabric:71715:1713994966::tcp:core:fi_param_define_():251 registered var zerocopy_size libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable staging_sbuf_size= libfabric:71715:1713994966::tcp:core:fi_param_get_():372 variable prefetch_rbuf_size= libfabric:71715:1713994966::tcp:core:fi_param_get_():372

I'm not sure I see anything more helpful on the backtraces? The added debug log info may help though.

So far, it looks like NVSHMEM tries to call fi_getinfo(), and this function fails to find any providers, but somehow the control flow continues to the point of a segfault.

From the extra fabric debug info, if I do grep -E "cannot open shared object file" out.txt | wc -l I see 200 such "missing" libs. For example, I don't see this shared lib in a standard ucx install:

libfabric:60833:1713994966::core:core:ofi_reg_dl_prov():610<debug> dlopen(libucx-fi.so): libucx-fi.so: cannot open shared object file: No such file or directory

Maybe the 200 missing *-fi.so shared libs is a hint? I'll see what I can do about experimenting with more explicit fabric options in my libfabric builds with spack, though I don't think I've needed that previously.

Edit: spack install libfabric@1.18.1+debug fabrics=mlx,sockets,tcp,udp,ucx ^ucx+debug and using that version via LD_LIBRARY_PATH seems to be of no use either.

j-xiong commented 2 months ago

@tylerjereddy Those warnings about missing *-fi.so are fine. Those files only exist for providers built as dl libraries, which most providers are by default not.

Based on the log, the libfabric installation doesn't have ucx provider at all. The top part of the stack trace indicates that the ucx error occurred in another thread (probably spawned by other part of NVSHMEM or OpenMPI).

 0  libucs.so.0(ucs_handle_error+0x294) [0x14e151015394]
 1  libucs.so.0(+0x30564) [0x14e151015564]
 2  libucs.so.0(+0x3082e) [0x14e15101582e]
 3  /lib64/libpthread.so.0(+0x168c0) [0x14e15303c8c0]
 4  nvshmem_transport_libfabric.so.1(+0xb9fd) [0x14e12c1b99fd]

tylerjereddy commented 2 months ago

@j-xiong interesting...

If I do fi_info -l I see the providers I built libfabric with listed near the top:

ucx:
    version: 118.10
udp:
    version: 118.10
tcp:
    version: 118.10
sockets:
    version: 118.10
<snip>

However, fi_info -p ucx returns the same -61 that NVSHMEM reports in its C-level query, while fi_info -p for udp and sockets both succeed, dumping a bunch of output.

So, what's the conclusion here? The -l command suggests consistency with how I built libfabric fabrics=mlx,sockets,tcp,ucx,udp, but the -p probes suggest that there's a runtime issue picking up ucx only? Is the sane debugging approach to then just try a bunch different builds of ucx and load them in via LD_LIBRARY_PATH?

I'll take a shot at turning on even more ucx options at compile time, I see they have a detailed backtrace options in addition to debug, and some threading options I suppose I could try turning on.

tylerjereddy commented 2 months ago

Turning on a ton more ucx compile options does change the backtrace as shown below. Actually, that's a bit strange, I don't even see ucx on the backtrace proper anymore (it is still absent from fi_info -p ucx though).

spack install ucx+cuda+gdrcopy+debug+backtrace_detail+assertions+dm+logging+parameter_checking+thread_multiple %gcc@12.2.0

``` /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:1524: non-zero status: -61 No providers matched fi_getinfo query: -61: No data available [nid001249:102664:0:102664] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) [nid001249:102665:0:102665] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) [nid001249:102666:0:102666] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10) ==== backtrace (tid: 102663) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x000000000000b9fd nvshmemt_libfabric_finalize() libfabric.cpp:0 2 0x000000000000d312 nvshmemt_init() ???:0 3 0x00000000000e6c89 nvshmemi_transport_init() :0 4 0x00000000000ec01c nvshmemi_common_init() :0 5 0x00000000000ede99 nvshmemi_try_common_init() :0 6 0x00000000000ee30e nvshmemx_host_init() ???:0 7 0x000000000052e635 cufftMpDestroyReshape() ???:0 8 0x000000000011b4c3 cufftXtSetCallbackSharedSize() ???:0 9 0x0000000000147450 cufftXtMakePlanGuru64() ???:0 10 0x0000000000148105 cufftXtMakePlanMany() ???:0 11 0x0000000000145d7d cufftMakePlanMany64() ???:0 12 0x00000000001461bf cufftMakePlanMany() ???:0 13 0x0000000000146386 cufftMakePlan3d() ???:0 14 0x0000000000406922 run_r2c_c2r_slabs() ???:0 15 0x0000000000407fbb main() ???:0 16 0x000000000003529d __libc_start_main() ???:0 17 0x0000000000405a5a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ```

j-xiong commented 2 months ago

The output of fi_info with the -l option is generated differently from without the -l option. With the -l option, no attempt is made to check if the providers are usable. It just returns the name and version of the available providers.

The fact that fi_info -p ucx returns -61 indicates that either the ucx provider doesn't exist or is not usable. The provider appears in the output of fi_info -l suggests the latter. What puzzled me was that the output with FI_LOG_LEVEL=debug doesn't have any line related to the ucx provider. For example, for the sockets provider we can see:

libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_waittime
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var conn_timeout
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_conn_retry
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_conn_map_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_av_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_cq_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var def_eq_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var pe_affinity
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_enable
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_time
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_intvl
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var keepalive_probes
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var iface
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var max_buf_sz
libfabric:60831:1713994966::sockets:core:fi_param_define_():251<debug> registered var dgram_drop_rate

The ucx provider has similar parameter definitions at the very beginning of the provider initialization code and we expect to see similar output for the ucx provider.

Could you try again with FI_LOG_LEVEL=debug fi_info -p ucx?

tylerjereddy commented 2 months ago

Sure, see the attached file below (it is too large to paste in its entirety). I do see libfabric:114048:1714069851::ucx:core:ucx_getinfo():256<info> no ucx device is found. before the -61 return value, maybe that's what you're looking for.

out.txt

Are there some common causes of ucx providers not being usable?

j-xiong commented 2 months ago

Yes, that's the info I want to see.

The ucx provider looks for devices under /sys/class/infiniband with vendor id 0x15b3 (NVIDIA). It appears that no such device exists on your system.

See the code here: https://github.com/ofiwg/libfabric/blob/v1.18.x/prov/ucx/src/ucx_init.c#L207

tylerjereddy commented 2 months ago

The /sys/class/infiniband folder is indeed empty, but is that surprising given that this is a Cray Slingshot 11 machine that doesn't use IFB? (the fact that we're using SS11 has been the source of many problems, and is indeed the reason we need super-new OpenMPI as well, for libfabric support).

j-xiong commented 2 months ago

Not surprising at all. That just confirms that libfabric not picking up the ucx provider is an expected behavior.

Back to the segfault, what is going on inside nvshmemt_libfabric_finalize()?

[nid001249:102664:0:102664] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001249:102665:0:102665] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
[nid001249:102666:0:102666] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid: 102663) ====
 0 0x00000000000168c0 __funlockfile()  ???:0
 1 0x000000000000b9fd nvshmemt_libfabric_finalize()  libfabric.cpp:0
 2 0x000000000000d312 nvshmemt_init()  ???:0
 3 0x00000000000e6c89 nvshmemi_transport_init()  :0  
 4 0x00000000000ec01c nvshmemi_common_init()  :0  
 5 0x00000000000ede99 nvshmemi_try_common_init()  :0  
 6 0x00000000000ee30e nvshmemx_host_init()  ???:0

tylerjereddy commented 2 months ago

Using good old "printf peppering," it seems to be this segment of that function that is called last (control flow never exceeds checkpoint 7b):

Final section of C++ function source that gets printed before crash

```c++ 1440 printf("** nvshmemt_libfabric_finalize checkpoint 7\n"); 1441 1442 if (libfabric_state->addresses) { 1443 for (int i = 0; i < NVSHMEMT_LIBFABRIC_DEFAULT_NUM_EPS; i++) { 1444 printf("** nvshmemt_libfabric_finalize checkpoint 7b\n"); 1445 status = fi_close(&libfabric_state->addresses[i]->fid); 1446 printf("** nvshmemt_libfabric_finalize checkpoint 7c\n"); 1447 if (status) { 1448 NVSHMEMI_WARN_PRINT("Unable to close fabric address vector: %d: %s\n", status, 1449 fi_strerror(status * -1)); 1450 } 1451 printf("** nvshmemt_libfabric_finalize checkpoint 7d\n"); 1452 } 1453 } 1454 printf("** nvshmemt_libfabric_finalize checkpoint 8\n"); ```

Based on grepping the output log with grep -E "checkpoint" out.txt, where this is repeated for each rank it seems:

** nvshmemt_libfabric_finalize checkpoint 1
** nvshmemt_libfabric_finalize checkpoint 2
** nvshmemt_libfabric_finalize checkpoint 3
** nvshmemt_libfabric_finalize checkpoint 4
** nvshmemt_libfabric_finalize checkpoint 6
** nvshmemt_libfabric_finalize checkpoint 7
** nvshmemt_libfabric_finalize checkpoint 7b

So, crash at fi_close() call, or one of the structure member accesses therein?

j-xiong commented 2 months ago

So the failure happened when closing the endpoints.

Question: Since fi_getinfo() previously returned -61, is there another fi_getinfo() call that succeeded later? If so, which provider was actually being used? If not, how it comes to the point to close the endpoint which is not opened before?

seth-howell commented 2 months ago

Hi all,

The segfault is happening in an error path inside of nvshmemt_init. It seems we are missing a null check for endpoints in an error path. (Instancing a bug for this internally). We should fail gracefully here, but do not.

I see two issues before this one that lead to us entering this error path in the first place though:

Since you are using slingshot 11, you need to have the CXI provider built and enabled in your libfabric installation. We can't find the CXI provider which supports the features called out here. It seems that provider wasn't compiled into your libfabric build @tylerjereddy. Looking at the source, I don't see the CXI provider in this repository until somewhere between 1.20.2 and 1.21.0. I don't know anything beyond this about compiling or loading the CXI provider with versions of libfabric <1.21.0. I have only used pre-compiled libfabric + CXI.
I see errors with respect to not having cuda support. This is also a compile-time option.

[ ](libfabric:114111:1714069859::core:core:ofi_hmem_init():414 Hmem iface FI_HMEM_CUDA not supported)

These errors will disable HMEM CUDA support required by NVSHMEM to register device memory. This is a configure-time option in libfabric. if you pass --with-cuda[=dir] to configure, it will fail if it can't find the directory.

tylerjereddy commented 2 months ago

CXI is closed source, so for now I switch to our HPC module libfabric/1.15.2.0, which does have CXI, but apparently not the CUDA requirement mentioned above (see output log below). spack doesn't even have a CUDA option for libfabric config, guess I'll need to check with local HPC to see if they can do it for me or send me their build instructions for CXI-enabled libfabric so I can also add CUDA.

out_with_cxi.txt

raffenet commented 2 months ago

@tylerjereddy https://github.com/ofiwg/libfabric/issues/9835 the CXI provider is open source, but it cannot be built on any machine we have come across without significant hacks.

tylerjereddy commented 2 months ago

Ok, I'm referring to the build system we usually use which indicates: https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/libfabric/package.py#L87

CXI is a closed source package and only exists when an external.

maybe they haven't updated yet, but the effect is the same I guess

tylerjereddy commented 2 months ago

On latest libfabric main branch:

./autogen.sh
./configure --prefix=/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom --enable-debug --enable-cxi=/usr --disable-opx --disable-psm2 --disable-efa --disable-rxm --disable-sockets --enable-psm3 --enable-tcp --enable-verbs --with-cuda=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3

I get:

configure: *** Configuring cxi provider
checking cxi_prov_hw.h usability... no
checking cxi_prov_hw.h presence... yes
configure: WARNING: cxi_prov_hw.h: present but cannot be compiled
configure: WARNING: cxi_prov_hw.h:     check for missing prerequisite headers?
configure: WARNING: cxi_prov_hw.h: see the Autoconf documentation
configure: WARNING: cxi_prov_hw.h:     section "Present But Cannot Be Compiled"
configure: WARNING: cxi_prov_hw.h: proceeding with the compiler's result
configure: WARNING:     ## ------------------------------------------ ##
configure: WARNING:     ## Report this to ofiwg@lists.openfabrics.org ##
configure: WARNING:     ## ------------------------------------------ ##

The libfabric install we have was done by HPE directly apparently, and was done on a version that preceeds direct support for CXI, so may be patched in some way as Seth had suggested to me at some point. --enable-cxi=/usr is based on HPC support basically just finding some of the relevant files for CXI in subdirs there.

raffenet commented 2 months ago

@tylerjereddy the configure issues are fixed in #9793 if you want to try and cherry-pick those commits. Make sure to re-run autogen.sh.

tylerjereddy commented 2 months ago

@raffenet Cool, I ended up having to use https://github.com/thomasgillis/libfabric/tree/dev-cxi after following some more cross-linked issues/PRs that complain about LANL-specific problems, and this does seem to get me a new segfault at runtime at least, and I think the problem is still related to the CUDA provider based on the fi_info -p cuda output in the log below.

out_cxi_and_cuda_libfabric.txt

It looks like libfabric does correctly identify the number of CUDA devices based on the verbose output there, so some progress is being made. Just before the call to fi_info -p cuda "fails" with -61, I see libfabric:60885:1714254829::core:core:fi_getinfo_():1304<info> fi_getinfo: provider ofi_mrail returned -61 (No data available). Do I need that ofi_mrail provider as well here, or?

Edit: here's my current configure line for libfabric:

./configure --prefix=/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom --enable-cxi=/usr --disable-opx --disable-psm2 --disable-efa --disable-rxm --disable-sockets --enable-psm3 --enable-tcp --enable-verbs --with-cuda=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3

tylerjereddy commented 2 months ago

cc @hppritcha as well perhaps

tylerjereddy commented 2 months ago

I checked that addition of --enable-mrail to the config line above followed by clean rebuild doesn't help, fi_info -p cuda and finfo -p ofi_mrail both still fail the same way it seems.

out_cxi_and_cuda_and_mrail_libfabric.txt

I think I'm still misunderstanding something though, because I can get the same kinds of errors with fi_info -p junk, rather than getting an error about a provider name that can't possibly exist. That's pretty confusing! What's the idiomatic way to probe the CUDA support of libfabric then? Something must be different about the CUDA support..

I wonder if this is relevant:

libfabric:9451:1714321826::cxi:fabric:cxip_nic_get_ss_env_get_vni():23<info> nid001204: SLINGSHOT_VNIS not found

j-xiong commented 2 months ago

@tylerjereddy You don't need to enable the mrail provider. The line fi_getinfo: provider ofi_mrail returned -61 in the debug trace is normal which just says the mrail provider can't be used.

There is no provider called cuda so passing -p cuda is equivalent to use -p junk. Usually, the provider would pick the GPU/device interface (or HMEM interface in libfabric's term) based on either application input (e.g. iface passed to fi_mr_regattr) or some autodetect logic. You can force to use cuda only by setting FI_HMEM=cuda. That setting will prevent other HMEM interfaces from being initialized.

tylerjereddy commented 2 months ago

I get the same backtraces with export FI_HMEM=cuda, darn:

out_fi_hmem_cuda.txt

tylerjereddy commented 2 months ago

@seth-howell I built NVSHMEM with NVSHMEM_DEBUG=1 (along with verbose libfabric settings) and tried to pull out some relevant portions of the 150 MB log file for the 2-node cuFFTMp reproducer. In this scenario, it looks like we get things erroring out instead of hitting a segfault. This output isn't in order, just grepped based on things that look suspicious.

Looks like a mixture of remote memory access (RMA), GDRCopy, CXI, and CUDA-related complaints, but not sure if it is clear what needs to be addressed.

libfabric:98917:1714598377::cxi:ep_data:cxip_rma_emit_dma():329<warn> nid001448: TXC (0x1b32:0): Failed to emit dma command: -11:Resource temporarily unavailable
libfabric:98917:1714598377::cxi:ep_data:cxip_rma_common():635<warn> nid001448: TXC (0x1b32:0): DMA RMA write failed: buf=0x150278b80600 len=80 rkey=0 roffset=0x18b806a0 nic=0x1b72 pid=0 pid_idx=117

nid001449:67129:67129 [2] NVSHMEM INFO [6] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x4e147b0
nid001448:98917:98917 [1] NVSHMEM INFO [1] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
libfabric:98918:1714598371::core:core:cuda_gdrcopy_dev_unregister():362<warn> gdr_unmap failed! error: Invalid argument
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:nvshmemt_libfabric_rma:517: Received an error when trying to post an RMA operation.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/include/internal/common/nvshmem_internal.h:nvshmemi_process_multisend_rma:302: aborting due to error in process_channel_dma
libfabric:67128:1714598429::core:core:cuda_gdrcopy_dev_register():333<warn> gdr_map failed! error: Cannot allocate memory

/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp 1706 GDRCopy requested, but unused by transport. Disabling
libfabric:98717:1714598347::core:core:fi_param_get_():372<info> variable hmem_cuda_use_gdrcopy=<not set>
libfabric:67130:1714598364::core:core:fi_param_get_():399<info> read bool var hmem_cuda_use_gdrcopy=1

tylerjereddy commented 2 months ago

Here's a shorter ouput log with different debug/verbosity settings and also export GDRCOPY_ENABLE_LOGGING=1 turned on. It seems to have extra output with ERR: mh is not mapped yet before hitting the segfault via NVSHMEM->CXI->CUDA (gdrcopy).

``` /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin/fi_info fi_info -l: verbs: version: 120.0 cxi: version: 0.1 psm3: version: 305.1010 ofi_rxd: version: 120.0 shm: version: 120.0 udp: version: 120.0 tcp: version: 120.0 ofi_hook_perf: version: 120.0 ofi_hook_trace: version: 120.0 ofi_hook_debug: version: 120.0 ofi_hook_noop: version: 120.0 ofi_hook_hmem: version: 120.0 ofi_hook_dmabuf_peer_mem: version: 120.0 off_coll: version: 120.0 sm2: version: 120.0 ofi_mrail: version: 120.0 fi_info -p cxi: provider: cxi fabric: cxi domain: cxi0 version: 0.1 type: FI_EP_RDM protocol: FI_PROTO_CXI provider: cxi fabric: cxi domain: cxi1 version: 0.1 type: FI_EP_RDM protocol: FI_PROTO_CXI rm -rf cufftmp_r2c_c2r_slabs_GROMACS /lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin/../bin/nvcc cufftmp_r2c_c2r_slabs_GROMACS.cu -o cufftmp_r2c_c2r_slabs_GROMACS -std=c++17 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_90,code=sm_90 -I/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp -I/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include -I/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/include -lcuda -L/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib -L/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib -lcufftMp -lnvshmem_device -lnvshmem_host -L/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib -lmpi LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib:/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-hc255f5j4fcqhtufeisjj3pytrkv4dqt/lib/ucx:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-hc255f5j4fcqhtufeisjj3pytrkv4dqt/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:/opt/cray/pe/papi/7.0.0.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 cufftmp_r2c_c2r_slabs_GROMACS Hello from rank 3/8 using GPU 3 Hello from rank 7/8 using GPU 3 Hello from rank 4/8 using GPU 0 Hello from rank 2/8 using GPU 2 Hello from rank 6/8 using GPU 2 Hello from rank 0/8 using GPU 0 Hello from rank 1/8 using GPU 1 Hello from rank 5/8 using GPU 1 /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. ERR: mh is not mapped yet ERR: mh is not mapped yet [nid001448:105413:0:105413] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18) ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ERR: mh is not mapped yet ==== backtrace (tid: 105413) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x0000000000001aa7 gdr_unmap() ???:0 2 0x0000000000032d92 cuda_gdrcopy_dev_unregister() :0 3 0x00000000000a488f cxip_unmap() :0 4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 5 0x00000000000adfe5 cxip_evtq_progress() :0 6 0x0000000000081695 cxip_ep_progress() :0 7 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0 8 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441 9 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51 10 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19 11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39 12 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41 13 0x0000000000500a66 cufftMpDestroyReshape() ???:0 14 0x00000000004ff598 cufftMpDestroyReshape() ???:0 15 0x000000000015893a cufftMpAttachComm() ???:0 16 0x00000000004e058f cufftMpDestroyReshape() ???:0 17 0x00000000004e0a85 cufftMpDestroyReshape() ???:0 18 0x000000000014cb6e cufftMpAttachComm() ???:0 19 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0 20 0x0000000000147511 cufftXtMakePlanGuru64() ???:0 21 0x0000000000148105 cufftXtMakePlanMany() ???:0 22 0x0000000000145d7d cufftMakePlanMany64() ???:0 23 0x00000000001461bf cufftMakePlanMany() ???:0 24 0x0000000000146386 cufftMakePlan3d() ???:0 25 0x0000000000406619 run_r2c_c2r_slabs() ???:0 26 0x00000000004079c7 main() ???:0 27 0x000000000003529d __libc_start_main() ???:0 28 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ================================= -------------------------------------------------------------------------- This help section is empty because PRRTE was built without Sphinx. -------------------------------------------------------------------------- make: *** [Makefile:18: run] Error 139 ```

tylerjereddy commented 2 months ago

It looks like an NVIDIA engineer responded to the ERR: mh is not mapped yet type error at https://github.com/NVIDIA/gdrcopy/issues/242#issuecomment-1396372455, but needed a custom build of gdrcopy to get to the bottom of it. I'm assuming that we're ok to use spack find -v gdrcopy

gdrcopy@2.3+cuda build_system=makefile cuda_arch=none patches=c5efec1

tylerjereddy commented 2 months ago

When I build gdrcopy master branch by hand, in the interest of pulling in https://github.com/NVIDIA/gdrcopy/pull/248, and then rebuild NVSHMEM 2.10.1 to point at that version of gdrcopy, I do see different behavior.

In particular, I see the cuFFTMp reproducer code hang on two nodes instead of segfaulting. Not sure if this might be diagnostically useful cc @pakmarkthub.

tylerjereddy commented 2 months ago

I guess the Slingshot 11 network was having problems (based on feedback from local HPC). When I re-run today with gdrcopy master branch I see this output/backtrace:

``` Hello from rank 7/8 using GPU 3 Hello from rank 4/8 using GPU 0 Hello from rank 5/8 using GPU 1 Hello from rank 6/8 using GPU 2 Hello from rank 3/8 using GPU 3 Hello from rank 1/8 using GPU 1 Hello from rank 2/8 using GPU 2 Hello from rank 0/8 using GPU 0 /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state. ERR: mh is not mapped yet [nid001217:115514:0:115701] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18) ERR: mh is not mapped yet ERR: mh is not mapped yet ==== backtrace (tid: 115701) ==== 0 0x00000000000168c0 __funlockfile() ???:0 1 0x0000000000001aa7 gdr_unmap() ???:0 2 0x0000000000032d92 cuda_gdrcopy_dev_unregister() :0 3 0x00000000000a488f cxip_unmap() :0 4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0 5 0x00000000000adfe5 cxip_evtq_progress() :0 6 0x0000000000081695 cxip_ep_progress() :0 7 0x000000000008b599 cxip_cntr_readerr() cxip_cntr.c:0 8 0x000000000000dfc2 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:446 9 0x00000000000e4bad progress_transports() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:963 10 0x00000000000e51b9 progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:992 11 0x000000000000a6ea start_thread() ???:0 12 0x0000000000117a6f __GI___clone() ???:0 ================================= ```

tylerjereddy commented 1 month ago

I believe using the correct CXI provider build of libfabric resolved the original segfaults here.

We're now seeing other crashes in gdrcopy, quite possibly from libfabric incantations, per the cross-linked issue above. However, I think it is best if I close this and open a separate issue if I manage to get a minimum viable reproducer for libfabric side.

ofiwg / libfabric

BUG, MAINT: segfaults through libfabric->ucx #10001