Bug: scheduler's critically delayed event processing has positive feedback loop

sharnoff commented 8 months ago

Environment

Prod (occurred twice recently)

Steps to reproduce

Hard to reproduce locally - requires odd circumstances under a lot of load. AFAICT we've only seen it occur at startup — probably because that's when there's the most stuff going on.

The general idea of the triggering behavior is this:

For some reason, event processing in the scheduler gets delayed (whether that's because of weird behavior at startup, or some other failure mode)
Some VM pod start events are delayed enough that the VMs are deleted before the events are handled
While handling those events:
1. We don't see the VM in the local VM store (because it doesn't exist)
2. Thinking the store is out of date, we Relist() — this can take ~2s
3. After relisting, the VM is still not in the store, so we return error
Because relisting takes so long, more events get delayed, so we handle more pod start events after the VM was deleted, so we cause even more delays

Note that this is also because we only handle a single event at a time, so waiting for 2s handling one event holds up the entire queue.

(Originally, that's because otherwise we'd have to be careful to avoid out of order start/stop events - there's ways around this, though).

Other logs, links

Tasks

- [ ] #853
- [ ] #854
- [ ] #863

Skipping duplicate Relist()s is somewhat complex, but is provably a solution here. Processing events in parallel allows us to make the problem small enough that we never reach the critical threshold of a positive feedback loop.

sharnoff commented 7 months ago

Partial reoccurence here, I think: https://neondb.slack.com/archives/C03F5SM1N02/p1710952126841459

(delayed event handling, but not critically so)

sharnoff commented 7 months ago

Assigning @Omrigan and removing myself to reflect that remaining work will be via #863 (and #865), rather than #853.

Omrigan commented 7 months ago

Done via #863

neondatabase / autoscaling