mercury-hpc / mercury

Mercury is a C library for implementing RPC, optimized for HPC.
http://www.mcs.anl.gov/projects/mercury/
BSD 3-Clause "New" or "Revised" License
163 stars 60 forks source link

HG_Respond with extra buf lead to libfabric run out of recv entry #735

Open hsx6876 opened 1 month ago

hsx6876 commented 1 month ago

Describe the bug HG_Respond with extra buf lead to libfabric run out of rxm recv_entry

found this when test daos project

When testing the Daos project, I discovered a bug where on the server side, calling HG_Respond with an extra buffer after the client RPC timeout and calling HG_Cancel, the client won't handle the response, leading to no acknowledgment being sent to the server.

The server posts a recv_expected to wait for the client ack, which consumes a recv_entry in ofi_rxm. Eventually, recv_entry runs out, causing this hg_context unable to post any more recv .

The expected behavior would be for HG/NA to take action to drop the recv_expected (waiting for ack ) to free up recv_entry. Consider adding a timeout mechanism maybe.

//

We are currently adding a timeout detection mechanism to crt_reply to prevent this from happening