Open thomasgillis opened 1 year ago
When removing the FI_ATOMIC
from the endpoint cap set, I am to get rid of this error.
So it leads me to this chunk of code might be the issue
if (rxm_domain->util_domain.info_domain_caps & FI_ATOMIC) {
ret = rxm_mr_add_map_entry(&rxm_domain->util_domain,
&msg_attr, rxm_mr);
if (ret)
goto map_err;
}
Do you have any idea/input?
FWIW I have FI_ORDER_NONE
for the ordering arguments
The verbs provider requires FI_MR_PROV_KEY. The tcp provider does not, but it can be enabled when the domain is opened. When rxm is layered over verbs, it will pass through the MR key returned by verbs. Are you seeing failures with both verbs and tcp? Note that fi_getinfo() can clear the FI_MR_PROV_KEY bit from the mr_mode flags. So, you need to explicitly reset it if you want to force the provider to generate a key (it just increments an integer for this).
@shefty I have prov_key enable, I will make sure it's not waived by fi_getinfo
.
I am able to run with tcp
so I think it's only when layered on top of verbs
EDIT: it's on after the fi_getinfo
But I have just noticed that removing FI_MR_LOCAL
solves the issue as well. hopefully it helps
tested with rxm over verbs : rxm will report the above error either if its out of memory or if the key already exists in its mr map suggesting that the same memory registration was requested twice and a subsequent insertion into the map failed due to existing an entry. The user requested key has no effect when FI_MR_PROV_KEY is set. The MR cache should be enabled to see the error when the same memory registration is requested twice which returns the same key from verbs. if the MR cache is not enabled verbs returns a unique mrkey and the above error is not seen.
Describe the bug
FI_PROV_KEY
seems to be broken withofi_rxm
I open a domain with an info struct with
FI_PROV_KEY
mode on andFI_ATOMIC
cap. Then, when registering memory I use a value of0
for the requested key (should be ignored) When usingtpc/verbs;ofi_rxm
, thefi_mr_reg
function returns with an error:I am not sure about the exact reason of this issue, I could only notice that:
FI_PROV_KEY
value here