Closed mratsim closed 4 years ago
Random reproduction on my machine, with GDB trace
(gdb) bt
#0 0x0000558879f3c30b in findVictim__5OLJbJhLYuFgICzMiLhy5Q ()
#1 0x0000558879f3c708 in trySteal__9agQ9cSofpr3mmBNymbHAfPA ()
#2 0x0000558879f3d6c0 in recvElseSteal__gpjNZOOOBMUGVrsaD0EE1Q ()
#3 0x0000558879f3f7df in wait__OvJxRK5afaM0uqnaxQ4veA ()
#4 0x0000558879f5d61c in gemm_impl__hBgdOPbRKz85JKQLrSn0uw ()
#5 0x0000558879f6063d in gemm_strided_nestable__g3VhDY0FncuQ6N3TvHnwfg ()
#6 0x0000558879f60cbb in testVsReference__9b4XSi2lcCNz75qje9cE9aQQw ()
#7 0x0000558879f6100c in NimMainInner ()
#8 0x0000558879f61190 in NimMain ()
#9 0x0000558879f272bd in main ()
The state machine rework should prevent the root thread remaining alone with tasks created in its queue that it cannot process: https://github.com/mratsim/weave/pull/128/commits/c31e45f72501506fe11cf2c99cae0984fdd8f3ce
See previous:
And the new one:
Returning to CheckTask ensures that a worker exhaust its queue and don't leave any task there while previously it would only run the stolen task which might spawn non-awaited new tasks or enqueue delayed tasks.
If there are stall left, it's probably related to parallel for #130
In PR https://github.com/mratsim/weave/pull/118, the Azure tests are passing but 6/8 of the Travis tests are failing due to "no output received in the past 10 min"
https://travis-ci.com/github/mratsim/weave/builds/162064695
Example