Closed skyzh closed 1 month ago
The compaction circuit breaker thing was: https://github.com/neondatabase/neon/pull/8675
The tenant not active stuff may be a bug outside the pageserver, if something is calling in to the wrong place...
TODO: follow back the slack thread to the logs & figure out what was going on: were we trying to upload heatmap from something not in the active state?
Internal analysis notes: https://www.notion.so/neondatabase/Investigate-https-github-com-neondatabase-neon-issues-8678-775320d313d448459fa4a0791a71fbb3?pvs=4
public tl;dr:
tenant_manager.upsert_location
call here https://github.com/neondatabase/neon/blob/31122adee3fa0dac6bec5bfd1682f04adf42e490/pageserver/src/http/routes.rs#L1405-L1408
attach
)secondary_controller.upload_tenant
call here: https://github.com/neondatabase/neon/blob/31122adee3fa0dac6bec5bfd1682f04adf42e490/pageserver/src/http/routes.rs#L1412-L1416
Active
yet ⇒ bails out with the error we see in the logsIn terms of fixing this, I see two options:
if let Some(timeout) = flush
block, use tenant.wait_to_become_active(timeout)
upsert_location
call returns, use tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT)
The latter is what the other user of upsert_location
--- do_shard_split
--- does.
Fundamentally, there's no way to easily eliminate this bug class entirely, as with both alternatives, there remains the possibility of hitting a timeout. But we can make more specific noise about the timeout being hit.
OpenAPI docs say:
The OpenAPI docs don't make any statements about the activation state of the tenant after /location_config
returns with 200
.
NB the "requested state" mentioned here
refers to Attached
vs Secondary
, not the Attached(Attaching)
/ Attached(Active)
sub-states.
So, I think the solution is to
inside the if let Some(timeout) = flush block, use tenant.wait_to_become_active(timeout)
Bigger picture though, I think something else went wrong here.
We received requests to transition into AttachedStale
with flush_ms=Some(...)
, but with increasing generation number. That combination makes us not take the fast path inside upsert_location
.
Going to extend the internal notes doc and post an update here.
More investigation notes: https://www.notion.so/neondatabase/Investigate-https-github-com-neondatabase-neon-issues-8678-775320d313d448459fa4a0791a71fbb3?pvs=4#a5ce5102a1264f1bb1110c55a485e8f5
public tl;dr
location_config
call fails the heatmap upload, due to the tenant not being active yet, as explained earlierif warmUp == true
⇒ HeatmapUpload
call to ps-13
I don’t understand why the repeat AttachedStale calls have incremented gen .
I don’t understand why the repeat AttachedStale calls have incremented gen .
This part is a bug, I think. We had a case last week (INC-275) where during a long running migration to controller, the control plane kept bumping generations.
WDYT, should we close this ticket without action?
Or still implement the proposed fix
https://github.com/neondatabase/neon/issues/8678#issuecomment-2296141388
I lean toward not bothering to support this gracefully in the pageserver: if someone is calling into location_conf with AttachedStale and bumping generation, then something weird is already going on. If we were going to work around this I would at most skip the heatmap upload if the tenant isn't active, but that would be a little invasive in put_tenant_location_config_handler
, and the design intent of this heatmap upload is that it's allowed to fail, so I'm okay with just leaving this as-is.
Closing as duplicate of https://github.com/neondatabase/cloud/issues/16672
https://neondb.slack.com/archives/C033RQ5SPDH/p1723205212988669