mratsim / weave

A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead
Other
539 stars 21 forks source link

Livelock in syncScope #119

Closed mratsim closed 4 years ago

mratsim commented 4 years ago

In PR https://github.com/mratsim/weave/pull/118, the Azure tests are passing but 6/8 of the Travis tests are failing due to "no output received in the past 10 min"

https://travis-ci.com/github/mratsim/weave/builds/162064695

Example

========================================================================================
Running [ c -d:danger ] benchmarks/matmul_gemm_blas/test_gemm_output.nim
========================================================================================
Test [2x2] * [2x2] -> [2x2]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x3] -> [2x3]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x9] -> [2x9]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x37] -> [2x37]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x129] -> [2x129]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x2] * [2x700] -> [2x700]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x2] -> [2x2]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x3] -> [2x3]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x9] -> [2x9]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x37] -> [2x37]
  Mean Relative Error of Weave (nestable) vs reference: 0.0
Test [2x3] * [3x129] -> [2x129]

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
The build has been terminated
mratsim commented 4 years ago

Random reproduction on my machine, with GDB trace

image

(gdb) bt
#0  0x0000558879f3c30b in findVictim__5OLJbJhLYuFgICzMiLhy5Q ()
#1  0x0000558879f3c708 in trySteal__9agQ9cSofpr3mmBNymbHAfPA ()
#2  0x0000558879f3d6c0 in recvElseSteal__gpjNZOOOBMUGVrsaD0EE1Q ()
#3  0x0000558879f3f7df in wait__OvJxRK5afaM0uqnaxQ4veA ()
#4  0x0000558879f5d61c in gemm_impl__hBgdOPbRKz85JKQLrSn0uw ()
#5  0x0000558879f6063d in gemm_strided_nestable__g3VhDY0FncuQ6N3TvHnwfg ()
#6  0x0000558879f60cbb in testVsReference__9b4XSi2lcCNz75qje9cE9aQQw ()
#7  0x0000558879f6100c in NimMainInner ()
#8  0x0000558879f61190 in NimMain ()
#9  0x0000558879f272bd in main ()
mratsim commented 4 years ago

The state machine rework should prevent the root thread remaining alone with tasks created in its queue that it cannot process: https://github.com/mratsim/weave/pull/128/commits/c31e45f72501506fe11cf2c99cae0984fdd8f3ce

See previous: image

And the new one: image

Returning to CheckTask ensures that a worker exhaust its queue and don't leave any task there while previously it would only run the stolen task which might spawn non-awaited new tasks or enqueue delayed tasks.

If there are stall left, it's probably related to parallel for #130