Open jcsp opened 6 months ago
After a short discussion with John, two more design choices in my mind:
an early draft of the proposal -> https://www.notion.so/neondatabase/DRAFT-Generic-Key-Value-Storage-on-Page-Server-fbe03ae11d4a4eb7b60ef800f89f0fa3
Aux file part is:
The pageserver is good at storing pages referenced by a block number, but not so good at generic key-value data.
There are two places we have large key-value maps without a good place to store them:
rel_size_to_key
)A more scalable store will enable:
This epic assumes:
Goals:
Possible implementation
We could write a hash table that uses a page for each slot.
Scale: 1 million relation sizes at ~16 bytes per relation -> use about 2000 pages.
Writes to pages will typically be repetitive: e.g. the logical size of a relation may be re-written very many times.
A single page per table can be used as a "superblock" that describes the number of slots that exist and the range of block numbers which contain them. This may be used to implement re-sharding of the table, as an optimization to avoid using a large page count for small databases (a very common case).
For storage, hash table slots will have a delta and value format. Runtime state will be used to bound the depth of deltas for each slot: the image value for the slot will be periodically written to enforce this bound. This will result in some I/O amplification for logical sizes compared with the current scheme of simply writing each size as an image every time, to a different page. We can control this I/O amplification by choosing the page count.
Where a particular KV collection requires larger values than typical (e.g.
pg_stat
can be multiple megabytes), we should use separate KV collections for the "big" values to avoid incurring heavy write amplification from mixing them with more frequently updated smaller values.So within each timeline, we would have three instances of this new hash table: