sandialabs / qthreads

Lightweight locality-aware user-level threading runtime.
https://www.sandia.gov/qthreads/
Other
173 stars 35 forks source link

Topology-aware Schedulers Hang Intermittently on MUSL #267

Closed insertinterestingnamehere closed 2 weeks ago

insertinterestingnamehere commented 2 months ago

I noticed this recently in CI. The MUSL CI tests have issues with intermittent hangs. It's on x86 and ARM but only seems to happen when there's a topology-aware scheduler together with an actual topology detection system. Nemesis is fine, and Sherwood and Distrib are fine if topology detection is off.

Tests known to hang especially frequently: aligned_writeFF_basic task_spawn

insertinterestingnamehere commented 2 weeks ago

Okay, I just spent some more time debugging this. Here's my best guess for what's going on. This actually appears to be some kind of horrible performance pathology that happens specifically in the CI MUSL config. The failures decrease if we dramatically increase the time limit for those builds. This also explains https://github.com/sandialabs/qthreads/issues/268 since that test can fail if one of the tasks in the test randomly takes dramatically longer than expected.

As to what's going on, My best guess is it has to do with the topology detection giving deceptive results in CI (see https://discuss.circleci.com/t/environment-variable-set-to-the-number-of-available-cpus/32670/3 for example), resulting in our schedulers mismatching the underlying topology. In cases where we bind workers to cores, that'll (likely) result in some queues not getting emptied until their work is stolen. I'm not sure why this is so dramatically worse in the Alpine/MUSL setup, but it is.

I can't reproduce this locally, even with MUSL. I'd argue that this is more of a problem with the CI environment, but I'll likely be overhauling the thread pool and topology management options soon anyway so there's little value in pursuing this right now even if it is something that could plausibly be worked around on our end.

Given that, I'm just going to disable the failing builds and call this resolved. The same goes for #268.