Closed thomas-bouvier closed 2 months ago
All 3 use cases should be supported as far as I know. The error indicates that there is a memory registration issue, not a transfer issue. Can you try again by turning off MR cache, FI_MR_CACHE_MONITOR=disabled
or try turning on cuda cache monitor with FI_MR_CUDA_CACHE_MONITOR_ENABLED=1
. Also have you verified that your cuda device ID is 0 ?
Thank you for your answer!
Unfortunately, disabling the MR cache with FI_MR_CACHE_MAX_COUNT=0
didn't change anything. The issue seems to be caused by something else.
gemini_mr_disabled.txt
I didn't spot any major difference with FI_MR_CUDA_CACHE_MONITOR_ENABLED=1
enabled either.
gemini_cuda_monitor_enabled.txt
I don't really understand what the device ID is referring to. There are 8 GPUs on the DGX-1 cluster I'm using, ranks are [0-7], I guess 0
should work?
I ran my reproducer on another machine where it works (Theta). I'm attaching the logs below. The first line that is different from the gemini
logs above is L2043, where cuda_mm_subscribe()
is called. This results in the following: Assigned CUDA buffer ID 26632 to buffer 0x7ff144400000
. The corresponding line on gemini
(my non-working setup) is L721, and I don't see any buffer ID being assigned anywhere from there.
theta.txt
The mystery remains...
Closing for now, please re-open the libfabric issue if needed.
Hello :)
Describe the bug
I'm trying to use RDMA to transfer a remote CPU variable to a local variable living in CUDA memory. First of all, is that use case supported? More generally, are the following scenarios supported:
If the later scenario is not supported, then this issue is irrelevant.
This is the error I'm getting:
I initialized Mercury with device memory support and MOFED is installed on the machines I'm using. I've tested on a DGX-1 cluster (part of the grid5000 testbed) and on a node on Cooley: both experiments yield to the same error.
To Reproduce
This example is using the Thallium API. I can try to rewrite it if needed.
The remote variable is an array of increasing integers stored on the CPU. The local variable is an array of the same size containing zeros and stored in CUDA memory. At the end of the program, I expect
devArray
to contain{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
.I'm eventually moving
devArray
to the CPU for the purpose of printing it (hostArray
variable). The program doesn't reach that line though, throwing theHG_FAULT
before that.Platform (please complete the following information):
Additional context
Here are some additional logs with
FI_LOG_LEVEL=debug HG_LOG_LEVEL=debug HG_SUBSYS_LOG=na
DGX-1 cluster gemini.txt
Cooley cooley.txt
Please note that I also noticed these lines on Cooley:
Thank you!