mochi-hpc / mochi-ssg

Scalable Service Groups (SSG), a group membership service for Mochi
Other
1 stars 1 forks source link

Program hanging when finalization is requested by remote #14

Closed shanedsnyder closed 3 years ago

shanedsnyder commented 3 years ago

In GitLab by @mdorier on Dec 17, 2019, 13:16

I modified the ssg-launch-group.c test in a new branch here: https://xgitlab.cels.anl.gov/sds/ssg/commits/test-finalize-callback In this branch, SSG finalization is done through a Margo finalization callback.

If you run it with a single process as follows:

mpirun -n 1 ./tests/ssg-launch-group -s 10 ofi+tcp mpi

The process is going to correctly shut down after 10 seconds.

However if the shutdown is requested by another process using margo_shutdown_remote_instance, the call to ssg_group_destroy in the callback will hang.

I didn't include the shutdown program but it's easy enough to write a small C program that takes the address of the process to shutdown, initializes margo, does a lookup of the address, calls margo_shutdown_remote_instance, then finalizes.

shanedsnyder commented 3 years ago

In GitLab by @mdorier on Dec 18, 2019, 04:19

I believe I found the issue: ssg_group_destroy calls ssg_group_destroy_internal, which calls swim_finalize, which ends up blocking on a margo_thread_sleep. This is because margo_thread_sleep requires the Mercury progress loop to be running. But finalization callbacks in Margo are called after the progress loop has been terminated, so those callbacks cannot have calls that require the loop to be running (i.e. margo_forward, or margo_thread_sleep, etc.).

I think the fix is in Margo rather than SSG: we should have some margo_push_prefinalize_callback functions to push callbacks that are intended to run before the progress loop is terminated. Those callbacks would allow for some more RPCs or margo_thread_sleep, but would not guarantee that the process won't receive RPCs meanwhile, contrary to the finalize callbacks, which guarantee that no more RPCs will be received, at the expense that no RPCs or timer can be posted anymore.

I'll add those functions in Margo, retry the SSG test program, and close the issue.

shanedsnyder commented 3 years ago

In GitLab by @mdorier on Dec 18, 2019, 07:54

Ok the problem is fixed when using the new margo_push_prefinalize_callback feature I just added to Margo. I did a PR to SSG (https://xgitlab.cels.anl.gov/sds/ssg/merge_requests/6) that adds a test of this feature (this PR doesn't have a code for shutting down remotely, though).

I'm closing the issue.

shanedsnyder commented 3 years ago

In GitLab by @mdorier on Dec 18, 2019, 07:54

closed