mochi-hpc / mochi-ssg

Scalable Service Groups (SSG), a group membership service for Mochi
Other
1 stars 1 forks source link

Problem with ordering of margo_init/margo_finalize and ssg_init/ssg_finalize #29

Open shanedsnyder opened 3 years ago

shanedsnyder commented 3 years ago

In GitLab by @mdorier on Jan 6, 2021, 12:55

A program that calls these functions in the following order will block on the ssg_finalize call:

This happens in particular when using a thallium engine, which wraps a margo instance and calls finalize in its destructor. The following order will lead to ssg_finalize being called before the destructor:

shanedsnyder commented 3 years ago

In GitLab by @shanedsnyder on Jan 6, 2021, 13:07

This is similar to the issues we were hitting with Python bindings.

The problem is mainly that SSG will initialize Argobots if it is not already, and if it is responsible for initializing Argobots, it makes sure to finalize it too at SSG shutdown time. That shuts down Argobots even though Margo is still using it, causing the hang.

I think the issue is mostly that the Argobots environment isn't reference counted somehow. It'd be nice if margo and SSG could each init/finalize Argobots without interfering with each other. We should probably just bring this up with the Argobots team to see if we can get this working?

shanedsnyder commented 3 years ago

In GitLab by @mdorier on Jan 6, 2021, 13:14

Oh I see, yes I think that'd be nice.

shanedsnyder commented 3 years ago

In GitLab by @carns on Jan 6, 2021, 15:30

I'm probably behind the times, but is that normal to init ssg before margo?

shanedsnyder commented 3 years ago

In GitLab by @shanedsnyder on Jan 6, 2021, 15:44

I added that ability so that you could use SSG ahead of margo_init, which was useful for getting SSG credential stuff (like for Cray DRC) working right. Basically you can init SSG, and load group files so that you can pull credential information out, and then pass those credentials to margo_init.