Closed mratsim closed 4 years ago
We use solution 2.
No impact on overhead as measured by fibonacci with lazy flowvars (to not measure memory overhead) under 200ms
And with normal Flowvar under 400ms
load distribution seems to be the same.
What may have changed is that on sync
and syncScope
in the steal phase the worker sends its non-direct child tasks first to its children which may be sleeping instead of its thief. If the task was short we could have saved energy by only sending to the thief.
Inversely, the load distribution might be better since we give the runtime the opportunity to wake up sleeping threads as otherwise sleeping threads are only woken up on a successful theft even though the current workers might have extra tasks. I.e. the change is more greedy and so more asymptotically optimal.
Unfortunately this is not fully fixed: https://travis-ci.com/github/mratsim/weave/jobs/323526172#L1854
After trying to mix both solutions we still have the bug (now rarer)
There is a more problematic root cause, merging for now as it still helps a lot.
Perf note when trying to use both
The FSM seems to have slightly higher overhead. 20ms slower on Fib(40) on both eager (387ms runtime) and lazy flowvars (204 ms runtime).
This is acceptable for now but hopefully we find the real bug.
Try to address #119
From the previous state machine
And some traces of a stuck process:
What seem to happen is:
recvElseSteal
which is the prologue ofSB_Steal
state.Analysis (barring other bugs)
SB_Steal
because all direct child tasks were processed (Note that the code assumes that all child tasks are at the beginning of the deque inpopFirstIfChild
)Conclusion and fix
2 solutions are possible:
SB_Steal
, don't only answer steal requests but also work sharing requests from idle workers