Describe the bug
HG_Respond with extra buf lead to libfabric run out of rxm recv_entry
found this when test daos project
When testing the Daos project, I discovered a bug where on the server side, calling HG_Respond with an extra buffer after the client RPC timeout and calling HG_Cancel, the client won't handle the response, leading to no acknowledgment being sent to the server.
The server posts a recv_expected to wait for the client ack, which consumes a recv_entry in ofi_rxm. Eventually, recv_entry runs out, causing this hg_context unable to post any more recv .
The expected behavior would be for HG/NA to take action to drop the recv_expected (waiting for ack ) to free up recv_entry. Consider adding a timeout mechanism maybe.
//
We are currently adding a timeout detection mechanism to crt_reply to prevent this from happening
Describe the bug HG_Respond with extra buf lead to libfabric run out of rxm recv_entry
found this when test daos project
When testing the Daos project, I discovered a bug where on the server side, calling HG_Respond with an extra buffer after the client RPC timeout and calling HG_Cancel, the client won't handle the response, leading to no acknowledgment being sent to the server.
The server posts a recv_expected to wait for the client ack, which consumes a recv_entry in ofi_rxm. Eventually, recv_entry runs out, causing this hg_context unable to post any more recv .
The expected behavior would be for HG/NA to take action to drop the recv_expected (waiting for ack ) to free up recv_entry. Consider adding a timeout mechanism maybe.
//
We are currently adding a timeout detection mechanism to crt_reply to prevent this from happening