HG_Respond with extra buf lead to libfabric run out of recv entry

Describe the bug HG_Respond with extra buf lead to libfabric run out of rxm recv_entry

found this when test daos project

When testing the Daos project, I discovered a bug where on the server side, calling HG_Respond with an extra buffer after the client RPC timeout and calling HG_Cancel, the client won't handle the response, leading to no acknowledgment being sent to the server.

The server posts a recv_expected to wait for the client ack, which consumes a recv_entry in ofi_rxm. Eventually, recv_entry runs out, causing this hg_context unable to post any more recv .

The expected behavior would be for HG/NA to take action to drop the recv_expected (waiting for ack ) to free up recv_entry. Consider adding a timeout mechanism maybe.

We are currently adding a timeout detection mechanism to crt_reply to prevent this from happening

mercury-hpc / mercury

HG_Respond with extra buf lead to libfabric run out of recv entry #735