Closed insertinterestingnamehere closed 2 weeks ago
Okay, I just spent some more time debugging this. Here's my best guess for what's going on. This actually appears to be some kind of horrible performance pathology that happens specifically in the CI MUSL config. The failures decrease if we dramatically increase the time limit for those builds. This also explains https://github.com/sandialabs/qthreads/issues/268 since that test can fail if one of the tasks in the test randomly takes dramatically longer than expected.
As to what's going on, My best guess is it has to do with the topology detection giving deceptive results in CI (see https://discuss.circleci.com/t/environment-variable-set-to-the-number-of-available-cpus/32670/3 for example), resulting in our schedulers mismatching the underlying topology. In cases where we bind workers to cores, that'll (likely) result in some queues not getting emptied until their work is stolen. I'm not sure why this is so dramatically worse in the Alpine/MUSL setup, but it is.
I can't reproduce this locally, even with MUSL. I'd argue that this is more of a problem with the CI environment, but I'll likely be overhauling the thread pool and topology management options soon anyway so there's little value in pursuing this right now even if it is something that could plausibly be worked around on our end.
Given that, I'm just going to disable the failing builds and call this resolved. The same goes for #268.
I noticed this recently in CI. The MUSL CI tests have issues with intermittent hangs. It's on x86 and ARM but only seems to happen when there's a topology-aware scheduler together with an actual topology detection system. Nemesis is fine, and Sherwood and Distrib are fine if topology detection is off.
Tests known to hang especially frequently:
aligned_writeFF_basic
task_spawn