I'm using libfabric through mercury. I'm trying to transfer a remote CPU variable into a local CUDA variable using provider verbs. This is failing on one of the systems I'm using.
To Reproduce
My reproducer (leveraging the Thallium API) leads to the issue on one machine, and works perfectly on another one. There should be no issue related to the code itself, and I guess the logs below are more insightful.
I suspect the issue might be caused by something not configured properly somewhere else on my system, or a missing driver. MOFED is installed.
Output
I ran my reproducer on two machines: output 1 gemini is incorrect (has the issue), output 2 theta is correct (the reproducer runs fine).
gemini.txttheta.txt
Hello :)
Describe the bug
For additional context, please see that issue.
I'm using libfabric through mercury. I'm trying to transfer a remote CPU variable into a local CUDA variable using provider
verbs
. This is failing on one of the systems I'm using.To Reproduce
My reproducer (leveraging the Thallium API) leads to the issue on one machine, and works perfectly on another one. There should be no issue related to the code itself, and I guess the logs below are more insightful.
I suspect the issue might be caused by something not configured properly somewhere else on my system, or a missing driver. MOFED is installed.
Output
I ran my reproducer on two machines: output 1
gemini
is incorrect (has the issue), output 2theta
is correct (the reproducer runs fine). gemini.txt theta.txtgemini
around L721 (has the issue):theta
around L2043 (working fine):What could cause a buffer ID not being assigned by
cuda_mm_subscribe()
in thegemini
output?I tried to disable the MR cache with
FI_MR_CACHE_MAX_COUNT=0
didn't change anything. The issue seems to be caused by something else. gemini_mr_disabled.txtEnvironment:
MOFED drivers are installed on both systems.
Thanks!