neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.57k stars 423 forks source link

Epic: scalable metadata key-value store in pageserver #7290

Open jcsp opened 6 months ago

jcsp commented 6 months ago

The pageserver is good at storing pages referenced by a block number, but not so good at generic key-value data.

There are two places we have large key-value maps without a good place to store them:

A more scalable store will enable:

This epic assumes:

Goals:

Possible implementation

We could write a hash table that uses a page for each slot.

Scale: 1 million relation sizes at ~16 bytes per relation -> use about 2000 pages.

Writes to pages will typically be repetitive: e.g. the logical size of a relation may be re-written very many times.

A single page per table can be used as a "superblock" that describes the number of slots that exist and the range of block numbers which contain them. This may be used to implement re-sharding of the table, as an optimization to avoid using a large page count for small databases (a very common case).

For storage, hash table slots will have a delta and value format. Runtime state will be used to bound the depth of deltas for each slot: the image value for the slot will be periodically written to enforce this bound. This will result in some I/O amplification for logical sizes compared with the current scheme of simply writing each size as an image every time, to a different page. We can control this I/O amplification by choosing the page count.

Where a particular KV collection requires larger values than typical (e.g. pg_stat can be multiple megabytes), we should use separate KV collections for the "big" values to avoid incurring heavy write amplification from mixing them with more frequently updated smaller values.

So within each timeline, we would have three instances of this new hash table:

skyzh commented 6 months ago

After a short discussion with John, two more design choices in my mind:

skyzh commented 6 months ago

an early draft of the proposal -> https://www.notion.so/neondatabase/DRAFT-Generic-Key-Value-Storage-on-Page-Server-fbe03ae11d4a4eb7b60ef800f89f0fa3

jcsp commented 5 months ago

Aux file part is: