keymanapp / keyman

Keyman cross platform input methods system running on Android, iOS, Linux, macOS, Windows and mobile and desktop web
https://keyman.com/
Other
390 stars 108 forks source link

bug(core): WASM tests hanging on ba-keyman-macos-2 #11794

Open mcdurdin opened 3 months ago

mcdurdin commented 3 months ago

Many Core WASM tests hanging on ba-keyman-macos-2, but not always. Same tests always running to completion successfully on ba-keyman-macos-1. Looks like we have different versions of node/emscripten on the two mac agents:

ba-keyman-macos-1: node 21.7.3, emscripten 3.1.57-git ba-keyman-macos-2: node 22.0.0, emscripten 3.1.59-git

Downloaded the built results for ba-keyman-macos-2 and tested locally, could not get it to fail.

Example build fails:

Running locally, meson test -j 1 seemed to always pass tests, every attempt I tried. meson test -j 2 or higher usually resulted in hanging tests.

Ran brew upgrade node (upgrades everything!) to 22.3.0, emscripten went to 3.1.61. This still sporadically fails in the same way.

mcdurdin commented 3 months ago

https://github.com/nodejs/node/issues/47748 seems related.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000198b419ec libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000198b7f55c libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x0000000104f1e1d8 libuv.1.dylib`uv_cond_wait + 40
    frame #3: 0x0000000100cf914c node`node::TaskQueue<v8::Task>::BlockingDrain() + 48
    frame #4: 0x0000000100cf7d6c node`node::NodePlatform::DrainTasks(v8::Isolate*) + 44
    frame #5: 0x0000000100b999ac node`node::SpinEventLoopInternal(node::Environment*) + 284
    frame #6: 0x0000000100ccfdf0 node`node::NodeMainInstance::Run(node::ExitCode*, node::Environment*) + 308
    frame #7: 0x0000000100ccfad0 node`node::NodeMainInstance::Run() + 124
    frame #8: 0x0000000100c435c4 node`node::Start(int, char**) + 476
    frame #9: 0x00000001987f60e0 dyld`start + 2360
mcdurdin commented 3 months ago

For now, -j 1 to serialize tests sidesteps the issue.

mcdurdin commented 3 months ago

Remove -j 1 and run meson test -t 0 on ba-=keyman-macos-2 in order to retrace our steps and repro the hang (most runs). I attached lldb to the hanging node process to grab the trace as shown above. Could probably escalate this to nodejs, but we don't currently have a minimal repro; seems like a race when 2+ node processes run at once, but would potentially need to do significant work on macos to bisect and repro on other devices.