neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.77k stars 429 forks source link

mgmt api: timeline_get: returns 409 on pageserver restart (`will not become active`) #8033

Open problame opened 4 months ago

problame commented 4 months ago

Context: this message and subsequent ones in the thread https://neondb.slack.com/archives/C06K38EB05D/p1718188056338629?thread_ts=1718184799.253779&cid=C06K38EB05D

Problem

During Pageserver shutdown, we returned 409 on a simple timeline get mgmt api request from cplane:

2024-06-11T21:23:48.858698Z  INFO request{method=GET path=/v1/tenant/47e027b9c2d93ce75b436e24fcb2a65c/timeline/2302f3fba952ba3476c5e2ece778c3cc request_id=f934edec-97c0-4425-a1db-bf4bd7fb3300}: Error processing HTTP request: Conflict: will not become active.  Current state: Stopping

Cplane interprets 409 as permanent error and fails the operation whereas we would have wanted it to retry in this case.

Analysis

In this case of PS restart we should be returning 503.

However the problem is that TenantState::Stopping is set in both temporary and permanent circumstances, for example

Impact

This bug was discovered during prodlike cloudbench branch creation phase.

It causes the prodlike cloudbench to skip the remaining phases (critically, the benchmarking phase).

This skews results and makes them not comparable run-by-run.

Further, it cost a good 20-30min engineering time to triage this stupid bug, a cost that will be paid each time we look at cloudbench results.

jcsp commented 4 months ago

temporary: pageserver is restarting permanent: tenant is deleting semi-permanent: tenant is detaching (error would go away once it's attached again)

Let's just make all of these 503s.