Debug Switchboard Hang with 100% CPU Spin

lschuermann commented 1 month ago

After a while, the switchboard becomes unresponsive to new enqueue-job requests on individual supervisors (the requests are accepted but not handled), and starts spinning on one or more CPU cores. This seems to indicate that there is some infinite loop, but it occasionally gives up control over the executor (so not just a loop {}).

We should debug this using the console-subscriber support introduced in #46.

max-cura commented 3 weeks ago

We believe this to be due to the following piece of code: https://github.com/treadmill-tb/treadmill/blob/c0adea5d60bf5a6fb92bcab8a2ef75d48bee98c9/switchboard/switchboard/src/sched.rs#L65-L87

If the supervisor disconnects, then this will cause .wait_for() to return Err, triggering an infinite hot loop. This is due to the selection of .wait_for(), as opposed to .changed(), since .wait_for() will do an 'early exit' if the current value matches the predicate, whereas .changed() would wait for an unseen value, always. It is also a symptom of a more general incorrect behaviour where the job termination watchdog fails to pick up the new supervisor connection, e.g. in case of reconnexion.

This is being addressed, though no PR currently exists.

max-cura commented 2 weeks ago

Fixed by #58 .

treadmill-tb / treadmill

Debug Switchboard Hang with 100% CPU Spin #49