Closed koivunej closed 1 month ago
Next steps:
Last week I did not get to start this, however, not everyone had a chance to look at the PR either. We did have one call about the RFC, summarized here: https://github.com/neondatabase/neon/pull/6888#issuecomment-1981093123
This week:
This week:
Starting implementation this week even if RFC discussion still is waiting for alternate proposals.
Per discussion:
index_part.json
, which will limit the amount of possible copied but not yet visible to say magical number 20
layersUpdate for the week:
After discussing with John:
So, reworking the implementation.
Reworking the implementation has been slow, but some:
The maintenance mode requirement can be relaxed to maintenance mode, which is only required for page_service connections because we can "rewrite" the ancestor+ancestor_lsn in RemoteTimelineClient. If we accept that the API endpoint leaves the in-memory state of Timelines as out-of-sync, then it is possible for the control plane to just reset
the tenant after the operation.
Another detached ancestor case was discovered: If the "new main" gc_cutoff
has progressed beyond the ancestor_lsn,
then there is no real connection between the two timelines. This also means we must not reparent any timelines but only do the metadata change. So far, I've been thinking of this case as "diverged".
The uncertainty the "diverged" case brings to which the selection of timelines that were reparented means that control plane will need to query the ancestry relationships always after invoking the operation.
Plan for the week:
Compaction problem:
Right now, if ancestor receives writes while the operation is ongoing, they might trigger a compaction. If there is a shutdown, then on retry we would end up with an union of the layers which might be a bad thing. Easy solutions include not allowing copying lsn prefix of L0 layers but does not scale to next compaction algorithm, only the current legacy.
Plan for the week:
Plan for the week:
Plan for the week:
Last week only "detach two ancestor for one specific prod tenant after deploy" was done for other work. Minor follow-up from that was completed:
Not started:
What remains:
New for this week:
Last week got trampled by testing and troubleshooting.
This week:
Similarly with the previous week.
This week:
Plan remains the same for the next few weeks when I will be on vacation.
Before that, production usage of timeline ancestor detach is required to retry with reset_tenant
endpoint in case #7830 happens. Using it in for sharded tenants is also not advisable, because the endpoint is not yet idempotent, but it could be done manually.
This week plan remains the same, but #7830 looks like an easy first issue disregarding testing.
This week:
Implement the
timeline_detach
API endpoint.Motivation
6888
DoD
Implementation ideas
Other related tasks and Epics