neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
150 stars 21 forks source link

Bug: Scheduler metrics can show buffer when there isn't any #671

Open sharnoff opened 9 months ago

sharnoff commented 9 months ago

Environment

Prod (eu-central-1)

Steps to reproduce

Unknown

Expected result

The node metrics reported by the scheduler should always match its internal state.

Actual result

The scheduler showed an extended period of non-zero buffer CPU/memory, even though the state dump showed no nodes with non-zero buffer CPU or memory (gist; collected by ntk a scheduler-state).

Screenshot of graphs in grafana, showing an initial spike in buffer CPU and memory, followed by a flat non-zero line of the buffer amount from 23:03 to 23:57

The scheduler was restarted at about 2023-12-05 23:57, at which point the issue was resolved.


Note that this was happening in at the same time as some other node-level disruptions on the same kubernetes node that the scheduler was on (slack link for more), so this there may have been some related impact there.

Other logs, links

sharnoff commented 5 months ago

Occurred a handful of times again today, possibly related to deploying scheduler without updating autoscaler-agents: https://neondb.slack.com/archives/C03H1K0PGKH/p1710352158661509?thread_ts=1710340649.207419