ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
575 stars 382 forks source link

prov/rxm: FI_PROV_KEY failure #9133

Open thomasgillis opened 1 year ago

thomasgillis commented 1 year ago

Describe the bug FI_PROV_KEY seems to be broken with ofi_rxm

I open a domain with an info struct with FI_PROV_KEY mode on and FI_ATOMIC cap. Then, when registering memory I use a value of 0 for the requested key (should be ignored) When using tpc/verbs;ofi_rxm, the fi_mr_reg function returns with an error:

libfabric:2881534:1689104430::ofi_rxm:domain:rxm_mr_add_map_entry():374<warn> MR map insert for atomic verification failed -266
OFI ERROR: Required key not available

I am not sure about the exact reason of this issue, I could only notice that:

thomasgillis commented 1 year ago

When removing the FI_ATOMIC from the endpoint cap set, I am to get rid of this error. So it leads me to this chunk of code might be the issue

    if (rxm_domain->util_domain.info_domain_caps & FI_ATOMIC) {
        ret = rxm_mr_add_map_entry(&rxm_domain->util_domain,
                       &msg_attr, rxm_mr);
        if (ret)
            goto map_err;
    }

Do you have any idea/input? FWIW I have FI_ORDER_NONE for the ordering arguments

shefty commented 1 year ago

The verbs provider requires FI_MR_PROV_KEY. The tcp provider does not, but it can be enabled when the domain is opened. When rxm is layered over verbs, it will pass through the MR key returned by verbs. Are you seeing failures with both verbs and tcp? Note that fi_getinfo() can clear the FI_MR_PROV_KEY bit from the mr_mode flags. So, you need to explicitly reset it if you want to force the provider to generate a key (it just increments an integer for this).

thomasgillis commented 1 year ago

@shefty I have prov_key enable, I will make sure it's not waived by fi_getinfo. I am able to run with tcp so I think it's only when layered on top of verbs

EDIT: it's on after the fi_getinfo But I have just noticed that removing FI_MR_LOCAL solves the issue as well. hopefully it helps

nikhilnanal commented 2 months ago

tested with rxm over verbs : rxm will report the above error either if its out of memory or if the key already exists in its mr map suggesting that the same memory registration was requested twice and a subsequent insertion into the map failed due to existing an entry. The user requested key has no effect when FI_MR_PROV_KEY is set. The MR cache should be enabled to see the error when the same memory registration is requested twice which returns the same key from verbs. if the MR cache is not enabled verbs returns a unique mrkey and the above error is not seen.