polarsignals / frostdb

❄️ Coolest database around 🧊 Embeddable column database written in Go.
Apache License 2.0
1.27k stars 65 forks source link

*: move block admin events out of the WAL #916

Open asubiotto opened 1 month ago

asubiotto commented 1 month ago

Currently, we record rotations and block creations in the WAL in order to determine which writes to discard because they have been persisted, which writes were included in a failed persistence attempt that needs to be retried, and which writes are still in the active memory block.

Most of the time, this works correctly, but deterministic simulation testing has shown that since these records are written asynchronously, there are cases where the protocol does not work, and even leads to data loss. Consider the following scenario:

If NewTableBlock was synchronously written in the above case, b1 would not be overwritten by b2 until b2's creation has been synchronously written to the WAL which can subsequently be observed in recovery. There could still be an issue if the snapshot truncates the WAL, since we could lose this "admin" event.

I attempted to write a commit that writes NewTableBlock synchronously, but it is not easy to shoehorn synchronous writes into a fundamentally asynchronous WAL, and DST found some more failure scenarios and deadlocks that do not give me confidence in this approach.

We have always considered that we should move these types of admin events out of the WAL for performance reasons, and it is clear now that we should also do this for correctness reasons. The only bytes that should be written to the WAL are those related to writes.

On recovery, the "true" state of the database is reconstructed by observing the admin records and the order in which they occur. We should explore whether this "true" state can be reconstructed by simply adding the txn of the highest write to persisted blocks. In this case, we could easily tell which transactions were actually persisted for a given table, and discard any in-memory writes with lower transactions. This seems like the best solution since we would reconstruct the true state of the database based on which blocks actually show up as persisted.