Closed lschuermann closed 2 weeks ago
We believe this to be due to the following piece of code: https://github.com/treadmill-tb/treadmill/blob/c0adea5d60bf5a6fb92bcab8a2ef75d48bee98c9/switchboard/switchboard/src/sched.rs#L65-L87
If the supervisor disconnects, then this will cause .wait_for()
to return Err
, triggering an infinite hot loop.
This is due to the selection of .wait_for()
, as opposed to .changed()
, since .wait_for()
will do an 'early exit' if the current value matches the predicate, whereas .changed()
would wait for an unseen value, always.
It is also a symptom of a more general incorrect behaviour where the job termination watchdog fails to pick up the new supervisor connection, e.g. in case of reconnexion.
This is being addressed, though no PR currently exists.
Fixed by #58 .
After a while, the switchboard becomes unresponsive to new enqueue-job requests on individual supervisors (the requests are accepted but not handled), and starts spinning on one or more CPU cores. This seems to indicate that there is some infinite loop, but it occasionally gives up control over the executor (so not just a
loop {}
).We should debug this using the console-subscriber support introduced in #46.