mochi-hpc / mochi-ssg

Scalable Service Groups (SSG), a group membership service for Mochi
Other
1 stars 1 forks source link

weird behavior with SSG's use of margo_forward_timeout #26

Closed shanedsnyder closed 3 years ago

shanedsnyder commented 3 years ago

In GitLab by @shanedsnyder on Dec 16, 2020, 10:30

In some testing, we've seen evidence that there could be issues with margo_forward_timeout behavior in SSG. Testing at scale, the default timeout of 2 seconds used by SSG has not been sufficient, but when bumping the default timeout value and timing SSG RPCs, things seem to complete in under 2 seconds.

We should investigate to see if there are bugs in the forward_timed call and should also consider whether we want to use general margo_forward within SSG (or come up with more flexible timeout values).

shanedsnyder commented 3 years ago

In GitLab by @shanedsnyder on Mar 18, 2021, 14:15

We only observed this issue on Summit (POWER architecture), and turns out there was an issue with Argobots mutexes that was leading to this issue. More details here:

https://lists.argobots.org/pipermail/discuss/2021-January/000094.html

In any case, this issue is resolved in Argobots (at the very least, using master branch).

shanedsnyder commented 3 years ago

In GitLab by @shanedsnyder on Mar 18, 2021, 14:15

closed