Open tylerjereddy opened 3 months ago
cc @jswaro
@tylerjereddy : Hi tyler, I missed this notification. I'll look into this.
===> [nid001252, 112670, 112670] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ce90 ===> [nid001252, 112670, 112670] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ce90, ret=0 ===> [nid001252, 112670, 112670] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ce90 ===> [nid001252, 112670, 112670] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ce90, ret=0 [112670, 112725] cxip_unmap_nocache() checkpoint 1 [112670, 112725] cxip_unmap_nocache() checkpoint 2 before unregister [112670, 112670] cxip_unmap_nocache() checkpoint 1 [112670, 112670] cxip_unmap_nocache() checkpoint 2 before unregister [112670, 112725] cuda_gdrcopy_dev_unregister() checkpoint 1 before spin lock gdrcopy->mh=0x7e1ce90 [112670, 112725] cuda_gdrcopy_dev_unregister() checkpoint 2 after spin lock and before unmap gdrcopy->mh=0x7e1ce90 ===> [nid001252, 112670, 112725] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ce90 [112670, 112670] cuda_gdrcopy_dev_unregister() checkpoint 1 before spin lock gdrcopy->mh=0x7e1ce90 ===> [nid001252, 112670, 112725] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ce90 ===> [nid001252, 112670, 112725] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ce90 ===> [nid001252, 112670, 112725] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ce90, ret=0 [112670, 112725] cuda_gdrcopy_dev_unregister() checkpoint 3 after unmap gdrcopy->mh=0x7e1ce90 [112670, 112725] cuda_gdrcopy_dev_unregister() checkpoint 4 before unpin_buffer gdrcopy->mh=0x7e1ce90 ===> [nid001252, 112670, 112725] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ce90 ===> [nid001252, 112670, 112725] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ce90 ===> [nid001252, 112670, 112725] GDRCopy Checkpoint _gdr_unpin_buffer: 3 [112670, 112725] cuda_gdrcopy_dev_unregister() checkpoint 4 after unpin_buffer gdrcopy->mh=0x7e1ce90 [112670, 112725] cxip_unmap_nocache() checkpoint 3 before unmap [112670, 112670] cuda_gdrcopy_dev_unregister() checkpoint 2 after spin lock and before unmap gdrcopy->mh=(nil) ===> [nid001252, 112670, 112670] GDRCopy Checkpoint gdr_unmap: 1: mh=(nil) [112670, 112725] cxip_unmap_nocache() checkpoint 4 after unmap [112670, 112725] cxip_unmap_nocache() checkpoint 5 before free [112670, 112725] cxip_unmap_nocache() checkpoint 6 after free
So, I looked at a single thread fault, and tracked it back to two separate threads.
In the log above, what seems to have occurred is the following sequence of events:
In PID=112670, there are at least two threads. TID=112670, and 122725.
Crash. MH is null, which triggers the issue in the library below. Something in libfabric likely zero'd the gdrcopy struct after the operation. The zero'ing of the struct likely happened while TID=112670 was acquiring the global lock.
If both threads are trying to unmap the same region, then we need one of the following solutions:
1) reference counting on the gdrcopy object to ensure that it isn't unmapped and freed until all references are gone, or 2) a check in the code to verify that the handle is still valid. If invalid, release the lock and bail.
@tylerjereddy Nice log capture. Checkpoints helped quite a bit.
@j-xiong : Do you see this happening in general with other providers using GDRCOPY? Seems like it could be unique to CXI's RMA operation tracking, but it seems worth asking.
@jswaro I haven't seen this with other providers yet. While I can't say similar thing won't happen with other providers, the sequence of events presented here is very much tied to the cxi provider.
@tylerjereddy : Couple questions
The CXI provider should have a 1-to-1 mapping between CXI memory region/descriptor and GDR handle. At face value, it seems like the same CXI memory region is being freed by two separate threads.
@iziemba @jswaro Do you both have access to Cray Slingshot 11 (>= 2 nodes) hardware? Probably the easiest way to debug would be to attempt to run the cuFFTMp multi-node example per my instructions at the top of https://github.com/NVIDIA/gdrcopy/issues/296 yourselves. If you don't have access to the hardware, I was hoping LANL HPC would help interface a bit, but in the meanwhile it may be possible for me to try to do an "interactive" debug session where I setup a reproducer and run it on two nodes for you to see, add prints, etc.
The problem is ultimately that we're trying to be able to easily build out a toolchain that performs multi-node cuFFTMp on Cray Slingshot 11 hardware so that we can do physics simulations with i.e., GROMACS, but this toolchain seems to be quite problematic and to often not complain when versions of components mismatch, apart from segfaulting.
You've got NVSHMEM and gdrcopy
from NVIDIA, libfabric
from OSS, and the CXI part still living on some custom branch, so it seems like this ecosystem still needs some work for it to be straightforward to have a high chance of things "just working" when building it out. And for my trainees with little software training, it is effectively intractable.
The original report was for x86_64, but we're now also wanting to do these builds on Grace-Hopper (aarch64
) as well, so if we could check buildouts on both of those architectures, and have clear instructions on how to do the builds (versions of NVSHMEM, libfabric
, gdrcopy
, and so on that should work together) that would be most helpful. Even better if it could happen via i.e., spack install ..
type packaged installs, though that seems farther off for now.
Yes, either one of us would have access to the hardware to reproduce. We'll see if this could be reproduced internally following the instructions in the ticket above.
While investigating https://github.com/NVIDIA/gdrcopy/issues/296#issuecomment-2111473932 I instrumented the
libfabric
dev-cxi
branch from @thomasgillis as follows:In a sample 2-node cuFFTMp run on a Slingshot 11 machine, I recorded the output: out12.txt
In particular, I was looking for evidence that
libfabric
was trying to free the same memory address multiple times, one level above where it was happening ingdrcopy
. If we check the log, for example,grep -E -i -n "0x7e1ddb0" out12.txt
, it looks like this is indeed happening:The
cuda_gdrcopy_dev_unregister()
function insrc/hmem_cuda_gdrcopy.c
is getting called on the same handle memory address on the same process, but a different thread. In another trial, it looked like the spin lock incuda_gdrcopy_dev_unregister()
behaved slightly better, but then ultimately there was still a crash with an attempt to free the unmapped address:In
cuda_gdrcopy_dev_unregister()
afterpthread_spin_lock(&global_gdr_lock);
, is there not a need to make sure that another thread hasn't unmapped and unpinned the handle before resuming? I did start messing around with this a little bit:It still wasn't enough to get my cuFFTMp example running, but looking at the output I no longer see evidence of a second thread traversing through
gdr_unmap
with the same address that was unpinned by the first thread. Is it plausible that this and other issues exist in this code, or am I way off base?