During Pageserver shutdown, we returned 409 on a simple timeline get mgmt api request from cplane:
2024-06-11T21:23:48.858698Z INFO request{method=GET path=/v1/tenant/47e027b9c2d93ce75b436e24fcb2a65c/timeline/2302f3fba952ba3476c5e2ece778c3cc request_id=f934edec-97c0-4425-a1db-bf4bd7fb3300}: Error processing HTTP request: Conflict: will not become active. Current state: Stopping
Cplane interprets 409 as permanent error and fails the operation whereas we would have wanted it to retry in this case.
Analysis
In this case of PS restart we should be returning 503.
However the problem is that TenantState::Stopping is set in both temporary and permanent circumstances, for example
temporary: pageserver is restarting
permanent: tenant is deleting
semi-permanent: tenant is detaching (error would go away once it's attached again)
Impact
This bug was discovered during prodlike cloudbench branch creation phase.
It causes the prodlike cloudbench to skip the remaining phases (critically, the benchmarking phase).
This skews results and makes them not comparable run-by-run.
Further, it cost a good 20-30min engineering time to triage this stupid bug, a cost that will be paid each time we look at cloudbench results.
Context: this message and subsequent ones in the thread https://neondb.slack.com/archives/C06K38EB05D/p1718188056338629?thread_ts=1718184799.253779&cid=C06K38EB05D
Problem
During Pageserver shutdown, we returned 409 on a simple timeline get mgmt api request from cplane:
Cplane interprets 409 as permanent error and fails the operation whereas we would have wanted it to retry in this case.
Analysis
In this case of PS restart we should be returning 503.
However the problem is that
TenantState::Stopping
is set in both temporary and permanent circumstances, for exampleImpact
This bug was discovered during prodlike cloudbench branch creation phase.
It causes the prodlike cloudbench to skip the remaining phases (critically, the benchmarking phase).
This skews results and makes them not comparable run-by-run.
Further, it cost a good 20-30min engineering time to triage this stupid bug, a cost that will be paid each time we look at cloudbench results.