Drop role sometimes fails in compute_ctl

neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.

https://neon.tech

Apache License 2.0

14.28k stars 408 forks source link

Drop role sometimes fails in compute_ctl #8345

Closed save-buffer closed 3 weeks ago

save-buffer commented 2 months ago

Steps to reproduce

Needs investigation, but somehow reassignment didn't work in this incident https://neondb.slack.com/archives/C07BB3NHUUX/p1720584945416209

Expected result

Compute startup should never fail due to bad role drop

Actual result

Sometimes it does

Environment

Logs, links

skyzh commented 1 month ago

I feel a quick fix is to allow the compute startup process to produce a warning in this case instead of erroring so that at least users can start their compute.

save-buffer commented 1 month ago

Yes in principle I also agree, but since we have some synchronization between cplane and compute, if we ignore these errors we can end up in an inconsistent state. We should probably just make the compute be the source of truth, and have some reconciliation back into the cplane if something fails. Seems like a fairly large-scope project

save-buffer commented 1 month ago

There's also the risk that if it generates warnings instead of causing an incident, we'll be lazy and just never fix it. Or put more diplomatically, it won't be "high priority" and we'll constantly have other, higher-priority things to fix. So not sure what the right call is

ololobus commented 1 month ago

Without the context it's hard to tell, but this is likely the duplicate of https://github.com/neondatabase/cloud/issues/13582

save-buffer commented 1 month ago

Ah nice, yes seems like a duplicate

ololobus commented 3 weeks ago

Closing as a duplicate of https://github.com/neondatabase/cloud/issues/13582