Motivation

6888

DoD

Implementation ideas

### Tasks
- [ ] https://github.com/neondatabase/neon/pull/7228
- [ ] https://github.com/neondatabase/neon/pull/7285
- [ ] https://github.com/neondatabase/neon/pull/7422
- [ ] https://github.com/neondatabase/neon/pull/7456
- [ ] https://github.com/neondatabase/neon/pull/7639
- [ ] https://github.com/neondatabase/neon/pull/7706
- [ ] https://github.com/neondatabase/neon/pull/7779
- [ ] https://github.com/neondatabase/neon/pull/7813
- [ ] https://github.com/neondatabase/neon/pull/7650
- [ ] https://github.com/neondatabase/neon/pull/7833
- [ ] #6888
- [ ] https://github.com/neondatabase/neon/issues/7830
- [ ] https://github.com/neondatabase/neon/pull/8229
- [ ] drop tenant before entering restart: https://github.com/neondatabase/neon/pull/8354#discussion_r1677687592
- [ ] https://github.com/neondatabase/neon/pull/8332
- [ ] https://github.com/neondatabase/neon/pull/8353
- [ ] https://github.com/neondatabase/neon/pull/8354
- [ ] https://github.com/neondatabase/neon/pull/8430
- [x] failpoint testing between the 3 different completion points
- [x] storcon: should just pass through the ApiError from pageserver
- [ ] --- Cutline for considering project functionally complete ---
- [ ] Persist the ancestry of timelines after detach, to enable future WAL recovery; fixme comment with full example added in https://github.com/neondatabase/neon/pull/8354#discussion_r1677681420
- [x] should L1s be written with the actual highest key? probably not, because we wouldn't know the highest key except only after completing the rewrite. Resolution: they should not be.
- [ ] pageserver config: default concurrency items from options
- [ ] unify reset_tenant and complete detaching
- [ ] there seems to always be one rewritten layer, but test asserts a case where that does not happen; investigate, is the straddling check wrong or test assertion does not work?
- [ ] TODO: discuss fate of WAL recovery for these tenants

Other related tasks and Epics

koivunej commented 7 months ago

Next steps:

hoping to start working on this week
other team members will read RFC this week
@arpad-m will look into interactions with DR (initdb, specifically)

koivunej commented 7 months ago

Last week I did not get to start this, however, not everyone had a chance to look at the PR either. We did have one call about the RFC, summarized here: https://github.com/neondatabase/neon/pull/6888#issuecomment-1981093123

This week:

hopefully start it after the RFC is merged

koivunej commented 7 months ago

This week:

hopefully we can close the pending RFC discussion:
- differences in approach with control plane team (waiting for their suggestion)
- decide on WAL DR being made possible or not
  - recording the lineage to index_part.json would be an ever growing list
merge the PR
start the work

koivunej commented 6 months ago

Starting implementation this week even if RFC discussion still is waiting for alternate proposals.

koivunej commented 6 months ago

Per discussion:

we avoid implementing the "truncation" of in-memory layers by flushing ancestor to disk
we can avoid implementing the head-object for an optimization completely by "periodically" uploading a new index_part.json, which will limit the amount of possible copied but not yet visible to say magical number 20 layers

koivunej commented 6 months ago

Update for the week:

implementation is making good progress
currently unimplemented:
- ancestor blocking (we don't want multiple detach_ancestor at the same time)
- persistent timeline "is detaching" recording (gc inhibition)

After discussing with John:

instead of online timeline detach, we should:
- avoid new locks on the read path
- require that branch has had a compute started and produced some WAL so that we have a previous record lsn
- assume maintenance mode
- shutdown and restart the whole tenant

So, reworking the implementation.

koivunej commented 6 months ago

Reworking the implementation has been slow, but some:

The maintenance mode requirement can be relaxed to maintenance mode, which is only required for page_service connections because we can "rewrite" the ancestor+ancestor_lsn in RemoteTimelineClient. If we accept that the API endpoint leaves the in-memory state of Timelines as out-of-sync, then it is possible for the control plane to just reset the tenant after the operation.

Another detached ancestor case was discovered: If the "new main" gc_cutoff has progressed beyond the ancestor_lsn, then there is no real connection between the two timelines. This also means we must not reparent any timelines but only do the metadata change. So far, I've been thinking of this case as "diverged".

The uncertainty the "diverged" case brings to which the selection of timelines that were reparented means that control plane will need to query the ancestry relationships always after invoking the operation.