neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
13.26k stars 374 forks source link

storcon: reduce "connection refused" period during upgrades (storcon deployments cause cplane operation failures (`connection refused\nrequest must not be retried`)) #8034

Open problame opened 3 weeks ago

problame commented 3 weeks ago

Context: https://neondb.slack.com/archives/C06K38EB05D/p1718209960490099?thread_ts=1718184799.253779&cid=C06K38EB05D

Problem

In prodlike cloudbench, we have observed that a storcon deployment can, 44s (!) after the storcon logs that it's up again, cause cplane to get connection refused errors when it tries to talk to storcon.

Analysis

@ololobus :

Networking in k8s may take some time to rollout and storcon has only one pod Same for LB / ingress to discover targets

Impact

When a Cplane client does a POST request, it doesn't retry them when it gets connection refused because it doesn't assume idempotency.

Example cplane log message

{"level":"ERR","ts":"2024-06-11T21:06:56.135Z","logger":"publicapiv2","message":"incoming request finished with internal error","http_meth":"POST","http_path":"/api/v2/projects/broad-boat-65064583/branches","route":"CreateProjectBranch","request_id":"70326701-9c98-4a9a-8fcd-aea2770ec8ed","trace_id":"T9WXJzpEJmibet3VyxrRjh","project_id":"broad-boat-65064583","account_id":"3eeaaef0-50fa-4074-8ed7-0a20f097d9fb","ingress_duration_ms":7277,"status":500,"account_id":"3eeaaef0-50fa-4074-8ed7-0a20f097d9fb","error":"could not create project-branch: Get \"http://neon-storage-controller.neon-storage-controller.svc.cluster.local:50051/v1/tenant/50f4379c3e4849ff7025fa4c14dced53/timeline/b7fd1f7c95d710c166c93d1ed0871324\": dial tcp 172.20.8.98:50051: connect: connection refused\nrequest must not be retried"}

Related

jcsp commented 3 weeks ago

Is this distinct from https://github.com/neondatabase/neon/issues/7797 ?

problame commented 3 weeks ago

7797 mentions 503, so, storcon was running.

What we observed here was connection refused, i.e., not even able to establish TCP connection.

A (very) narrow-minded solution to #7797 may not address connection refused issue.

But yeah, in spirit this is a dupe of #7797

jcsp commented 3 weeks ago

POST is idempotent as long as it includes a timeline ID -- @Bodobolero, until we make the controller more seamlessly available during restarts (in Q3), can you make your client retry past this class of error?