mercury-hpc / mercury

Mercury is a C library for implementing RPC, optimized for HPC.
http://www.mcs.anl.gov/projects/mercury/
BSD 3-Clause "New" or "Revised" License
168 stars 62 forks source link

HG: safe mechanism to deregister an RPC while handles for that RPC are still in use #534

Open carns opened 2 years ago

carns commented 2 years ago

Is your feature request related to a problem? Please describe.

Imagine a hypothetical scenario in which a service is periodically receiving a particular RPC type. The service then begins to shut down (without coordinating with clients) and deregisters that RPC as part of the shut down process.

In this case, a the service could have already begun executing handlers for the RPC, and those handlers will continue to execute despite deregistration. Margo includes a workaround for this that seems to cover most cases by simply checking whether the registered data associated with a given RPC is NULL or not when it is retrieved https://github.com/mochi-hpc/mochi-margo/pull/170.

Describe the solution you'd like

It may be cleaner if Mercury had a way to avoid impacting existing handles on a given RPC ID when deregistering. For example it could deny new RPCs on that ID immediately, but use reference counting to defer full deregistration until in-flight handles associated with the ID are all closed. There are probably other solutions; that's just one option.

Describe alternatives you've considered

So far it seems like in-flight RPCs aren't particularly harmed unless they rely on registered data associated with the RPC, but we are still testing.

shanedsnyder commented 2 years ago

The Margo fix that Phil mentions is only part of the solution, as it just applies to some Margo boiler-plate logic that runs before user RPC handler code. It looks like service RPC handlers themselves have to be careful not to assume they will be able to retrieve data registered with the RPC -- that's not a huge deal to add safety checks there, but it would be nice if Mercury could provide some stricter guarantees in terms of lifetime of registered data for RPC handlers that are already executing.

mdorier commented 6 months ago

Has this problem been solved in mercury 2.3.0?

soumagne commented 6 months ago

no this has not been implemented yet