neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.75k stars 428 forks source link

Epic: timeline detach ancestor #6994

Closed koivunej closed 1 month ago

koivunej commented 7 months ago

Implement the timeline_detach API endpoint.

Motivation

6888

DoD

Implementation ideas

### Tasks
- [ ] https://github.com/neondatabase/neon/pull/7228
- [ ] https://github.com/neondatabase/neon/pull/7285
- [ ] https://github.com/neondatabase/neon/pull/7422
- [ ] https://github.com/neondatabase/neon/pull/7456
- [ ] https://github.com/neondatabase/neon/pull/7639
- [ ] https://github.com/neondatabase/neon/pull/7706
- [ ] https://github.com/neondatabase/neon/pull/7779
- [ ] https://github.com/neondatabase/neon/pull/7813
- [ ] https://github.com/neondatabase/neon/pull/7650
- [ ] https://github.com/neondatabase/neon/pull/7833
- [ ] #6888
- [ ] https://github.com/neondatabase/neon/issues/7830
- [ ] https://github.com/neondatabase/neon/pull/8229
- [ ] drop tenant before entering restart: https://github.com/neondatabase/neon/pull/8354#discussion_r1677687592
- [ ] https://github.com/neondatabase/neon/pull/8332
- [ ] https://github.com/neondatabase/neon/pull/8353
- [ ] https://github.com/neondatabase/neon/pull/8354
- [ ] https://github.com/neondatabase/neon/pull/8430
- [x] failpoint testing between the 3 different completion points
- [x] storcon: should just pass through the ApiError from pageserver
- [ ] --- Cutline for considering project functionally complete ---
- [ ] Persist the ancestry of timelines after detach, to enable future WAL recovery; fixme comment with full example added in https://github.com/neondatabase/neon/pull/8354#discussion_r1677681420
- [x] should L1s be written with the actual highest key? probably not, because we wouldn't know the highest key except only after completing the rewrite. Resolution: they should not be.
- [ ] pageserver config: default concurrency items from options
- [ ] unify reset_tenant and complete detaching
- [ ] there seems to always be one rewritten layer, but test asserts a case where that does not happen; investigate, is the straddling check wrong or test assertion does not work?
- [ ] TODO: discuss fate of WAL recovery for these tenants

Other related tasks and Epics

koivunej commented 7 months ago

Next steps:

koivunej commented 7 months ago

Last week I did not get to start this, however, not everyone had a chance to look at the PR either. We did have one call about the RFC, summarized here: https://github.com/neondatabase/neon/pull/6888#issuecomment-1981093123

This week:

koivunej commented 7 months ago

This week:

koivunej commented 6 months ago

Starting implementation this week even if RFC discussion still is waiting for alternate proposals.

koivunej commented 6 months ago

Per discussion:

koivunej commented 6 months ago

Update for the week:

After discussing with John:

So, reworking the implementation.

koivunej commented 6 months ago

Reworking the implementation has been slow, but some:

The maintenance mode requirement can be relaxed to maintenance mode, which is only required for page_service connections because we can "rewrite" the ancestor+ancestor_lsn in RemoteTimelineClient. If we accept that the API endpoint leaves the in-memory state of Timelines as out-of-sync, then it is possible for the control plane to just reset the tenant after the operation.

Another detached ancestor case was discovered: If the "new main" gc_cutoff has progressed beyond the ancestor_lsn, then there is no real connection between the two timelines. This also means we must not reparent any timelines but only do the metadata change. So far, I've been thinking of this case as "diverged".

The uncertainty the "diverged" case brings to which the selection of timelines that were reparented means that control plane will need to query the ancestry relationships always after invoking the operation.

koivunej commented 5 months ago

Plan for the week:

Compaction problem:

Right now, if ancestor receives writes while the operation is ongoing, they might trigger a compaction. If there is a shutdown, then on retry we would end up with an union of the layers which might be a bad thing. Easy solutions include not allowing copying lsn prefix of L0 layers but does not scale to next compaction algorithm, only the current legacy.

koivunej commented 5 months ago

Plan for the week:

koivunej commented 5 months ago

Plan for the week:

koivunej commented 5 months ago

Plan for the week:

  1. finish up the openapi descriptions, adapt the code to work like described, document delta
  2. detach two ancestors for one specific prod tenant after deploy
  3. complete persistent gc+compaction blocking
    • pageserver crash (failpoint) tests
koivunej commented 5 months ago

Last week only "detach two ancestor for one specific prod tenant after deploy" was done for other work. Minor follow-up from that was completed:

Not started:

What remains:

New for this week:

koivunej commented 4 months ago

Last week got trampled by testing and troubleshooting.

This week:

koivunej commented 4 months ago

Similarly with the previous week.

This week:

koivunej commented 4 months ago

Plan remains the same for the next few weeks when I will be on vacation.

Before that, production usage of timeline ancestor detach is required to retry with reset_tenant endpoint in case #7830 happens. Using it in for sharded tenants is also not advisable, because the endpoint is not yet idempotent, but it could be done manually.

koivunej commented 3 months ago

This week plan remains the same, but #7830 looks like an easy first issue disregarding testing.

koivunej commented 2 months ago

This week: