Closed sharnoff closed 7 months ago
Partial reoccurence here, I think: https://neondb.slack.com/archives/C03F5SM1N02/p1710952126841459
(delayed event handling, but not critically so)
Assigning @Omrigan and removing myself to reflect that remaining work will be via #863 (and #865), rather than #853.
Done via #863
Environment
Prod (occurred twice recently)
Steps to reproduce
Hard to reproduce locally - requires odd circumstances under a lot of load. AFAICT we've only seen it occur at startup — probably because that's when there's the most stuff going on.
The general idea of the triggering behavior is this:
Relist()
— this can take ~2sNote that this is also because we only handle a single event at a time, so waiting for 2s handling one event holds up the entire queue.
(Originally, that's because otherwise we'd have to be careful to avoid out of order start/stop events - there's ways around this, though).
Other logs, links
Tasks
Skipping duplicate
Relist()
s is somewhat complex, but is provably a solution here. Processing events in parallel allows us to make the problem small enough that we never reach the critical threshold of a positive feedback loop.