neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.75k stars 428 forks source link

pageserver: generalized S3 DR including shard ancestry #9411

Open jcsp opened 2 days ago

jcsp commented 2 days ago

Currently S3 DR procedure https://www.notion.so/neondatabase/Storage-Recovery-from-S3-history-93c4b3f70265468ca49e27a19afd40b9 assumes you know what shards you need -- this may not be the case if the tenant was split recently.

We should generalize this to detect if the child shards are empty at the time point requested, and look up into parent shards.

arpad-m commented 1 day ago

Looking up my past PR #6821, it seems that I already thought about the shard splitting case: the time_travel_remote_storage endpoint requires a list of shard counts to be passed manually.

It's required to be the list of shard counts that have historically been in use between the target time up until the current time. Probably a grafana query can be built to obtain the shard counts. If we want something more "proper", we could build a table in the storage controller's database with a log of shard splitting operations. personally I'd like something like an operation log for the storage controller in general, where its decisions are reflected on a higher level than the info log level.

One can obtain the relevant shard counts via ListObjectVersions, but that query is not scalable. If we decided that it was, we could do S3 DR on the unsharded tenant ID as a prefix already. That should work because prefixes in S3 are just that: prefixes. a is a prefix of ab/c just as much as ab is.