We implement the architecture and most of the implementation required for tracking data statistics. The statistics will primarily be used for query planning. We achieved several design goals:
1) Not scanning the entire storage to update statistics
2) Allowing access to old versions statistics, which allows even very old time travel/MVCC usage without too much performance degradation
3) Not writing statistics to the storage layer in RocksDB, since we can degrade performance by updating the statistics keys on every transcation and statistics can/should be a primarily in-memory structure.
However, we take the trade off that the statistics are not always up-to-date. The update frequency is parameter we can optimise.
In the end, there is a single database-wide statistics struct, which is immutable and updated periodically. We update it by scanning the data WAL records and summing the count deltas since, and then replacing the Statistics struct held by the database atomically. The statistics are checkpointed into the WAL, which also allows us to time-travel to older snapshots and find a relatively accurate statistics entry from near that version.
This means we are solidifying the requirement that WAL cleaning, and MVCC compaction are tied to the same time-scale - both are required for going back to previous data versions correctly.
Future work
We could find that reading from the WAL to update statistics is a bottleneck. We can solve several problems at once, by extracting the commit data "cache" from the IsolationManager into the DurabilityClient, which can then be shared across isolation and statistics operations.
Implementation
Architecture
Promote WAL and Checkpoint management into Database
we move Checkpoint and associated commit_replay methods into a new module: //storage/recovery
we also add an (ultimately unused) system to 'extend' a checkpoint with additional data
Split database creation and loading into two separate entry points, and rearrange the corresponding methods all the way down into Storage, WAL, and Checkpointing. We also update corresponding tests.
Given a database directory, each module creates its own subdirectory:
MVCCStorage creates db-name/storage
WAL creates db-name/wal
Checkpoint creates db-name/checkpoint
Introduce //concept/thing/Statistics, which stores thing statistics for instances of each type, role playing, and relation indexing, etc.
We checkpoint statistics into the WAL using a new record type, and load the last one on bootup. This also helps solve the MVCC time-travel problem, where going back in time could lead to mismatched statistics being used (or even not relevant for the different schema!). We probably allow using the existing Statistics if the "old" sequence number being opened is no more than N (~100) versions behind the statistics version.
This means we are solidifying the requirement that WAL cleaning, and MVCC compaction are tied to the same time-scale - both are required for going back to previous data versions correctly.
Statistics catch-up/synchronisation is implemented by reading data commit records from the WAL.
For this we re-creatte write snapshots from commit data read from disk. However, this constructor intentionally returns a narrower API which means we cannot write or commit to a re-created write snapshots.
We add CommitType to the CommitRecord generated by CommittableSnapshots. This allows deserialising and recreating the correct type of snapshot (data or schema) from a WAL entry.
Refactor out the //storage/durability module into //durability package, which contains a simplified Service trait
We then create the DurabilityClient trait, which is now used throughout the code base wherever DurabilityService was used before
The intent is to allow extracting Durability into remote machine(s) using a Calvin-style partitioned WAL, if we wanted to. We will use the client to communicate with a set of durability servers, and manage collecting ordering information, etc.
For now, we only have a WALClient which wraps a WAL but conforms to the DurabilityClient trait
UX
We create a more consistent/comprehensive error structure for what happens if any of storage/wal/checkpoint are not present on bootup.
The presence or absence of the storage directory is irrelevant to bootup/recovery (same path). Being present simply optimises the recovery process since we have to copy fewer files from the checkpoints.
If the WAL is present on bootup, but no checkpoint is provided, we replay the WAL from scratch
If the WAL is present on bootup, and a checkpoint is provided, we replace the storage with the checkpoint and replay the WAL since the checkpoint
If the db-name database directory is present, but no WAL is present, this is an error state.
If the WAL directory is present, but for any reason data is required from the WAL is not present (for example, deleted or cleaned up) this is an error. This could happen when replaying the WAL from the start when the checkpoint is absent, or when a checkpoint is provided and the required replay point is not available in the WAL.
Usage and product changes
We implement the architecture and most of the implementation required for tracking data statistics. The statistics will primarily be used for query planning. We achieved several design goals:
1) Not scanning the entire storage to update statistics 2) Allowing access to old versions statistics, which allows even very old time travel/MVCC usage without too much performance degradation 3) Not writing statistics to the storage layer in RocksDB, since we can degrade performance by updating the statistics keys on every transcation and statistics can/should be a primarily in-memory structure.
However, we take the trade off that the statistics are not always up-to-date. The update frequency is parameter we can optimise.
In the end, there is a single database-wide statistics struct, which is immutable and updated periodically. We update it by scanning the data WAL records and summing the count deltas since, and then replacing the Statistics struct held by the database atomically. The statistics are checkpointed into the WAL, which also allows us to time-travel to older snapshots and find a relatively accurate statistics entry from near that version.
This means we are solidifying the requirement that WAL cleaning, and MVCC compaction are tied to the same time-scale - both are required for going back to previous data versions correctly.
Future work We could find that reading from the WAL to update statistics is a bottleneck. We can solve several problems at once, by extracting the commit data "cache" from the IsolationManager into the
DurabilityClient
, which can then be shared across isolation and statistics operations.Implementation
Architecture
Promote WAL and Checkpoint management into Database
Checkpoint
and associatedcommit_replay
methods into a new module://storage/recovery
Split database creation and loading into two separate entry points, and rearrange the corresponding methods all the way down into Storage, WAL, and Checkpointing. We also update corresponding tests.
Given a database directory, each module creates its own subdirectory:
MVCCStorage
createsdb-name/storage
WAL
createsdb-name/wal
Checkpoint
createsdb-name/checkpoint
Introduce
//concept/thing/Statistics
, which storesthing
statistics for instances of each type, role playing, and relation indexing, etc.We checkpoint statistics into the WAL using a new record type, and load the last one on bootup. This also helps solve the MVCC time-travel problem, where going back in time could lead to mismatched statistics being used (or even not relevant for the different schema!). We probably allow using the existing Statistics if the "old" sequence number being opened is no more than N (~100) versions behind the statistics version.
This means we are solidifying the requirement that WAL cleaning, and MVCC compaction are tied to the same time-scale - both are required for going back to previous data versions correctly.
Statistics catch-up/synchronisation is implemented by reading data commit records from the WAL.
CommitType
to theCommitRecord
generated byCommittableSnapshot
s. This allows deserialising and recreating the correct type of snapshot (data or schema) from a WAL entry.Refactor out the
//storage/durability
module into//durability
package, which contains a simplifiedService
traitDurabilityClient
trait, which is now used throughout the code base whereverDurabilityService
was used beforeclient
to communicate with a set of durability servers, and manage collecting ordering information, etc.WALClient
which wraps a WAL but conforms to theDurabilityClient
traitUX
We create a more consistent/comprehensive error structure for what happens if any of storage/wal/checkpoint are not present on bootup.
The presence or absence of the
storage
directory is irrelevant to bootup/recovery (same path). Being present simply optimises the recovery process since we have to copy fewer files from the checkpoints.storage
with the checkpoint and replay the WAL since the checkpointdb-name
database directory is present, but no WAL is present, this is an error state.