Open ldorau opened 2 years ago
@grom72 @osalyk @haichangsi @Patryk717
Hi @shefty , could you give a hint how it should be fixed? For example, is FI_MR_LOCAL
required in mr_mode
at:
https://github.com/pmem/libfabric/blob/main/fabtests/functional/rdm_stress.c#L1253
AFAIK, msg_mr
can be created only by .regv == vrb_mr_regv
or .regattr == vrb_mr_regattr
hooks, but none of them is called in the rdm_stress
test, so usage of FI_MR_LOCAL
seems suspicious for me (or this test just lacks one of these calls).
I missed that I was tagged on this way back when.
The rdm_stress tests is not coded to handle FI_MR_LOCAL correctly. At least one missing piece is in start_rcp(). After the resp buffer is allocated, the resp data needs to be registered if FI_MR_LOCAL is specified. The struct rpc_resp already has a mr field for this purpose, which is closed in complete_rpc().
I'd consider a set of changes along these lines:
static uint64_t rpc_resp_reg_flags[cmd_last] = {
0,
0,
FI_SEND,
0,
FI_SEND,
FI_READ,
FI_WRITE,
};
static void start_rpc(...)
{
...
resp = calloc(...)
if (need FI_MR_LOCAL && rpc_resp_reg_flags[req->cmd])
fi_mr_reg(...)
...
}
Describe the bug The server of
fi_rdm_stress
segfaults at prov/rxm/src/rxm_msg.c:314: https://github.com/ofiwg/libfabric/blob/main/prov/rxm/src/rxm_msg.c#L314for
i == 0
becausedesc[i] == 0x0
.To Reproduce Steps to reproduce the behavior: 1) Start the server:
2) Start the client:
Expected behavior The server of
fi_rdm_stress
does not segfault, but runs correctly.Output
Environment: provider: verbs
Debugging information 1)
mr_mode
isFI_MR_LOCAL
: https://github.com/pmem/libfabric/blob/main/fabtests/functional/rdm_stress.c#L1253so: 2)
rxm_ep->rdm_mr_local
istrue
: https://github.com/pmem/libfabric/blob/main/prov/rxm/src/rxm_ep.c#L1235but: 3)
desc[0] == NULL
, becausefi_send()
is called withdesc == NULL
inhandle_hello()
https://github.com/ofiwg/libfabric/blob/main/fabtests/functional/rdm_stress.c#L10064) It causes a segfault (NULL pointer dereference) at https://github.com/pmem/libfabric/blob/main/prov/rxm/src/rxm_msg.c#L305-L314
NOTICE
Removing
FI_MR_LOCAL
frommr_mode
at: https://github.com/pmem/libfabric/blob/main/fabtests/functional/rdm_stress.c#L1253 causes that this bug does not appear, only the assertion occurs in the client:Additional information