partial syncScope livelock

mratsim commented 4 years ago

Try to address #119

From the previous state machine syncScope

And some traces of a stuck process:

What seem to happen is:

The root thread seems to be the only thread left with all the other backed off
The root thread get stuck in a loop recvElseSteal which is the prologue of SB_Steal state.

Analysis (barring other bugs)

Hypothesis: All other threads are sleeping because they have no other tasks
Hypothesis: The root thread didn't exit the state machine because it is pending a descendant task
Hypothesis: The root thread is in SB_Steal because all direct child tasks were processed (Note that the code assumes that all child tasks are at the beginning of the deque in popFirstIfChild)

Conclusion and fix

At least one of the descendant task is stuck in the root thread. It is stuck because either it was not a direct child but a grandchildren task at least or because order assumptions are wrong and there is an unrelated task that couldn't be popped in front of the child.
The root thread didn't receive any steal request to dispatch the stuck tasks https://github.com/mratsim/weave/blob/943d04aeaceba2347c455962f98ef0a676018de1/weave/state_machines/sync_scope.nim#L123-L134 This can happen if all threads are idle.

2 solutions are possible:

Drain the whole task queue before switching to SB_Steal
Or in SB_Steal, don't only answer steal requests but also work sharing requests from idle workers

mratsim commented 4 years ago

We use solution 2.

No impact on overhead as measured by fibonacci with lazy flowvars (to not measure memory overhead) under 200ms

And with normal Flowvar under 400ms

load distribution seems to be the same.

What may have changed is that on sync and syncScope in the steal phase the worker sends its non-direct child tasks first to its children which may be sleeping instead of its thief. If the task was short we could have saved energy by only sending to the thief. Inversely, the load distribution might be better since we give the runtime the opportunity to wake up sleeping threads as otherwise sleeping threads are only woken up on a successful theft even though the current workers might have extra tasks. I.e. the change is more greedy and so more asymptotically optimal.

mratsim commented 4 years ago

Unfortunately this is not fully fixed: https://travis-ci.com/github/mratsim/weave/jobs/323526172#L1854

mratsim commented 4 years ago

After trying to mix both solutions we still have the bug (now rarer) 2020-04-26_21-39