pageserver: something triggered errors for tenant is not active while uploading heatmap

neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.

https://neon.tech

Apache License 2.0

14.4k stars 418 forks source link

pageserver: something triggered errors for tenant is not active while uploading heatmap #8678

Closed skyzh closed 1 month ago

skyzh commented 1 month ago

https://neondb.slack.com/archives/C033RQ5SPDH/p1723205212988669

jcsp commented 1 month ago

The compaction circuit breaker thing was: https://github.com/neondatabase/neon/pull/8675

The tenant not active stuff may be a bug outside the pageserver, if something is calling in to the wrong place...

jcsp commented 1 month ago

TODO: follow back the slack thread to the logs & figure out what was going on: were we trying to upload heatmap from something not in the active state?

problame commented 1 month ago

Internal analysis notes: https://www.notion.so/neondatabase/Investigate-https-github-com-neondatabase-neon-issues-8678-775320d313d448459fa4a0791a71fbb3?pvs=4

Result: https://www.notion.so/neondatabase/Investigate-https-github-com-neondatabase-neon-issues-8678-775320d313d448459fa4a0791a71fbb3?pvs=4#3a8945e67741403083ef523b5ce6c5d6

public tl;dr:

Call comes in to transition into AttachedStale
- tenant_manager.upsert_location call here https://github.com/neondatabase/neon/blob/31122adee3fa0dac6bec5bfd1682f04adf42e490/pageserver/src/http/routes.rs#L1405-L1408
  - takes the slow path because generation numbers are different
  - ⇒ shuts down tenant and spawns a new one in the higher generation (span name: attach )
- secondary_controller.upload_tenant call here: https://github.com/neondatabase/neon/blob/31122adee3fa0dac6bec5bfd1682f04adf42e490/pageserver/src/http/routes.rs#L1412-L1416
  - dispatches to here: https://github.com/neondatabase/neon/blob/c39d5b03e81683717bd95c87615d68b0b23e887d/pageserver/src/tenant/secondary/heatmap_uploader.rs#L283-L295
  - tenant isn’t Active yet ⇒ bails out with the error we see in the logs

problame commented 1 month ago

In terms of fixing this, I see two options:

inside the if let Some(timeout) = flush block, use tenant.wait_to_become_active(timeout)
right after upsert_location call returns, use tenant.wait_to_become_active(ACTIVE_TENANT_TIMEOUT)

The latter is what the other user of upsert_location --- do_shard_split --- does.

Fundamentally, there's no way to easily eliminate this bug class entirely, as with both alternatives, there remains the possibility of hitting a timeout. But we can make more specific noise about the timeout being hit.

problame commented 1 month ago

OpenAPI docs say:

https://github.com/neondatabase/neon/blob/f17fe75169029dc4f2cd778d02d3aac57aae2341/pageserver/src/http/openapi_spec.yml#L344-L346

The OpenAPI docs don't make any statements about the activation state of the tenant after /location_config returns with 200.

NB the "requested state" mentioned here

https://github.com/neondatabase/neon/blob/f17fe75169029dc4f2cd778d02d3aac57aae2341/pageserver/src/http/openapi_spec.yml#L365-L371

refers to Attached vs Secondary, not the Attached(Attaching) / Attached(Active) sub-states.

problame commented 1 month ago

So, I think the solution is to

inside the if let Some(timeout) = flush block, use tenant.wait_to_become_active(timeout)

problame commented 1 month ago

Bigger picture though, I think something else went wrong here.

We received requests to transition into AttachedStale with flush_ms=Some(...), but with increasing generation number. That combination makes us not take the fast path inside upsert_location.

Going to extend the internal notes doc and post an update here.

problame commented 1 month ago

More investigation notes: https://www.notion.so/neondatabase/Investigate-https-github-com-neondatabase-neon-issues-8678-775320d313d448459fa4a0791a71fbb3?pvs=4#a5ce5102a1264f1bb1110c55a485e8f5

public tl;dr

Cplane migration-into-storcon flow marks tenant on cplane-managed ps-13 as AttachedStale via location_config API
- That location_config call fails the heatmap upload, due to the tenant not being active yet, as explained earlier
- But location_config still returns 200
So, cplane flow proceeds with if warmUp == true ⇒ HeatmapUpload call to ps-13
- That call fails with 500 because tenant is still not active
So, cplane operation fails and is retried later

I don’t understand why the repeat AttachedStale calls have incremented gen .

jcsp commented 1 month ago

I don’t understand why the repeat AttachedStale calls have incremented gen .

This part is a bug, I think. We had a case last week (INC-275) where during a long running migration to controller, the control plane kept bumping generations.

problame commented 1 month ago

WDYT, should we close this ticket without action?

Or still implement the proposed fix

https://github.com/neondatabase/neon/issues/8678#issuecomment-2296141388

jcsp commented 1 month ago

I lean toward not bothering to support this gracefully in the pageserver: if someone is calling into location_conf with AttachedStale and bumping generation, then something weird is already going on. If we were going to work around this I would at most skip the heatmap upload if the tenant isn't active, but that would be a little invasive in put_tenant_location_config_handler, and the design intent of this heatmap upload is that it's allowed to fail, so I'm okay with just leaving this as-is.

problame commented 1 month ago

Closing as duplicate of https://github.com/neondatabase/cloud/issues/16672