lhmouse / mcfgthread

Cornerstone of the MOST efficient std::thread on Windows for mingw-w64
https://gcc-mcf.lhmouse.com/
Other
269 stars 28 forks source link

Condition variable deadlock when running under wine >=5.0 #44

Closed cvtsi2sd closed 4 years ago

cvtsi2sd commented 4 years ago

Code doing a ping-pong between two threads deadlocks when compiled with a mcfgthread-based cross-toolchain (64 bit, gcc 8.3.0-based, with the provided MCF patches applied) and run under wine >= 5.0 (5.0 on Ubuntu 20.04 x64, 5.9 on Arch Linux x64, plus various tests with Proton 5.0 using esync, again on Arch Linux x64). No problems when running under wine 3.0.

Here's the sample code: https://gist.github.com/cvtsi2sd/b9c534e6f447f35c781fc20deb3989b9

This also works fine (terminating with

turn: 1
count[0]: 50001
count[1]: 50000
tot_count: 100001

) when:

When it deadlocks (= exiting only due to timeout), I get it locked up at random stages.

comelz@7de7fbb980bd:/bug$ time WINEPATH=/opt/mingw64-mcf/x86_64-w64-mingw32/bin/ wine a.exe
Wait expired in thread 1 after 5 seconds
Status dump:
turn: 0
count[0]: 25
count[1]: 25
tot_count: 50

abnormal program termination

real    0m10.322s
user    0m0.122s
sys     0m0.140s
comelz@7de7fbb980bd:/bug$ time WINEPATH=/opt/mingw64-mcf/x86_64-w64-mingw32/bin/ wine a.exe
Wait expired in thread 1 after 5 seconds
Status dump:
turn: 0
count[0]: 2
count[1]: 2
tot_count: 4

abnormal program termination

real    0m5.306s
user    0m0.093s
sys     0m0.122s
comelz@7de7fbb980bd:/bug$ time WINEPATH=/opt/mingw64-mcf/x86_64-w64-mingw32/bin/ wine a.exe
Wait expired in thread 0 after 5 seconds
Status dump:
turn: 1
count[0]: 2
count[1]: 1
tot_count: 3

abnormal program termination

real    0m5.321s
user    0m0.111s
sys     0m0.113s
comelz@7de7fbb980bd:/bug$ time WINEPATH=/opt/mingw64-mcf/x86_64-w64-mingw32/bin/ wine a.exe
Wait expired in thread 1 after 5 seconds
Status dump:
turn: 0
count[0]: 7
count[1]: 7
tot_count: 14

abnormal program termination

real    0m5.301s
user    0m0.107s
sys     0m0.102s

Interestingly, if I don't run it under time it seems to get further:

comelz@7de7fbb980bd:/bug$ WINEPATH=/opt/mingw64-mcf/x86_64-w64-mingw32/bin/ wine a.exe
Wait expired in thread 0 after 5 seconds
Status dump:
turn: 1
count[0]: 825
count[1]: 824
tot_count: 1649

abnormal program termination
comelz@7de7fbb980bd:/bug$ WINEPATH=/opt/mingw64-mcf/x86_64-w64-mingw32/bin/ wine a.exe
Wait expired in thread 0 after 5 seconds
Status dump:
turn: 1
count[0]: 120
count[1]: 119
tot_count: 239

abnormal program termination
comelz@7de7fbb980bd:/bug$ WINEPATH=/opt/mingw64-mcf/x86_64-w64-mingw32/bin/ wine a.exe
Wait expired in thread 0 after 5 seconds
Status dump:
turn: 1
count[0]: 39
count[1]: 38
tot_count: 77

abnormal program termination

Between wine 3.0 and wine 5.0 I noticed several changes in sync.c, some to supposedly fix #36, some reimplementing SRW stuff basing it on futexes; I think they may be related, but I'm wondering if it's their implementation that is broken (it may be, I saw several very recent bugfixes to that file regarding keyed events), or the change of underlying implementation actually exposed a bug in MCF.

lhmouse commented 4 years ago

It seems really a bug. I can reproduce this deadlock on my Windows 7 machine.

lhmouse commented 4 years ago

I have almost forgotten the details. Look at https://github.com/lhmouse/mcfgthread/wiki/Pseudo-Code#condition-variable: The two paths (spinning and non-spinning, following if(bSpinnable) {) of condition variables look so different and I didn't think it worth implementing (if not optimizing), as in normal usecases condition variables should be protected by mutexes. Once flags also don't have this spinning loop. I think now we should remove it.

After the removal your program no longer deadlocks.

lhmouse commented 4 years ago

I presume the deadlock was caused by the fact that the signaling thread might run so quickly that it got and cleared the signal by itself (before the busy waiting thread could receive it), as the condition was not met it started another wait on the same condvar, while the waiting thread lost this signal, hence the two threads were blocking on the same condvar.

This renders the old implementation incorrect. It was invalid to unlock the mutex before bumping the trapped thread counter. And, thanks for the report, should work as expected now.