neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.3k stars 408 forks source link

Epic: Pageserver internal catalog #4636

Open LizardWizzard opened 1 year ago

LizardWizzard commented 1 year ago

Motivation

Currently in pageserver we have to maintain some amount of metadata. Metadata is used to properly load tenants and timelines and deal with non-atomic actions (such as bootstrapping a timeline from scratch using initdb).

There are two parts of the problem. If we take a look on a pageserver with 100k tenants on it then loading process has to open ridiculous number of files. Pageserver is supposed to operate with high number of tenants attached to it, but with most of them not being active (i e without running compute).

It has to list /tenants directory, load tenant config which is a separate file. Then for each tenant we list its timelines directory to learn which timelines are there. Then for each timeline we load metadata file and list the directory to learn which layer files are there.

The second part of problem are so-called mark files. When we need to perform some non atomic actions we use specially named files which indicate that some operation had started so we know if this file is still present then the operation must've been interrupted. So if the operation was interrupted by a crash restart we can clean up the traces of unfinished operations, or resume it whatever is more appropriate.

Examples of mark files include TimelineUninitMark which is used during timeline creation, Tenant ignore mark file which is used to temporarily exclude tenant from working set, Tenant attaching mark which is used to continue interrupted attach operations. There are also temporary tenant directories which are used during tenant initialization.

Working with mark files is non trivial and cumbersome. Tenant timeline deletion is another example of that.

DoD

Problems above are solved. There are no mark files, and metadata is stored in a way that allows to quickly load it

Implementation ideas

Note that there also needs to be some migration strategy. Whichever option we pick will need to have some gradual adoption plan.

The idea might be to concentrate needed apis in some struct Catalog first, and then modify implementation of the Catalog struct so it writes metadata into two places, and after that completely switch it so it uses only new solution.

### Tasks

Other related tasks and Epics

knizhnik commented 1 year ago

This has the downside that currently this kv store is tied to postgres semantics. GC and compaction desnt work for anything except postgres pages.

KV storage consists of two parts - first one perform mapping of Postgres key (pgdatadir_mapping) and another one is pure KV storage where key and value are opaque. Also GC and compaction has nothing to deal with Postgres pages - them operate with layers, not with pages.

The main drawback of using KV storage for storing this data is that this data is more or less temporary, while KV storage is persistent, including eviction to S3. But may be it is not so big problem...

Use SQLite or some other embedded solution.

There may be problems with concurrent access if hundreds of active tenants will try to access the same embedded engine. SQLite is very primitive concurrency control mechanism. But once again - it may be not a problem, if operation requiring access to this engine are quite rare.

LizardWizzard commented 1 year ago

and another one is pure KV storage where key and value are opaque.

They are sort of opaque. The storage operate with Lsn's and updates will keep history which needs compaction which has different meaning from our current compaction. Our compaction expects postgres specific keys to be present in the keyspace, I think gc also does that.

We can make it work, but this will require changes to make our kv storage truly opaque. I wouldnt underestimate the amount of required changes. Imagine how many edge cases it will create because iof two significantly different modes (catalog vs usual tenant)

SQLite is very primitive concurrency control mechanism

Yes, I dont think this can be a problem. I would implement this as a Catalog actor which processes messages one by one. This should be pretty simple to reason about and load shouldnt be that high. This is all metadata after all