polarsignals / frostdb

❄️ Coolest database around 🧊 Embeddable column database written in Go.
Apache License 2.0
1.25k stars 65 forks source link

Transaction rearchitecture #872

Open thorfour opened 1 month ago

thorfour commented 1 month ago

This issue is meant to describe how transactions work in FrostDB today, and how we would like them to work in the future. It is meant to act as a way to discuss implementation details for how we might re-architect the code to take what we have today and move it towards what we ideally want for transactions.

Today

Transactions are just a logical incrementing number that indicates a rough ordering of when writes came into the system. Today we do not offer a way to have a multi-write transaction; as such there is no possibility for write tearing. What that means is to the end user transactions are effectively meaningless. Our transactions also don't guarantee read snapshot isolation. Meaning that if you begin a read it is possible to have writes that technically newer than when you logically started your read from being returned in that read (but remember that they will be consistent writes returned not partial writes).

This non-isolation can happen in the following way:

  1. Read starts tx = 5
  2. Write starts and completes at tx = 6
  3. Previous write triggers a compaction
  4. Compaction moves the tx=6 into L1 (compaction destroys transaction information)
  5. Read from 1. continues and will read the data from the tx=6 write from L1 since it can't know what tx it's from

Today we also use our tx system to order administration events in the system via the write ahead log (WAL). Some of these events include block rotations to persistent storage, and snapshots taking place. Compactions currently are not recorded in the WAL.

A large amount of the above were not active design decisions, but were things that have come out of the tumultuous nature of the active development of FrostDB. Many of these decisions made sense at the time but as the code has evolved they may no longer serve the original purpose.

Hopes and Dreams

What we'd like to see out of our transaction system is:

thorfour commented 1 month ago

Idea:

Creating a database level pendingReaders waitgroup. We have table level waitgroups however this is unsuitable for ensuring snapshots aren't executing.

This would allow all reads to first bump the pending readers wg, and then to get a read transaction.

So what would happen is compaction would get the high watermark. Then it would wait for the database level readers wg, thus ensuring that active reads finish, and any reads about to start would get a watermark >= the compaction tx. Compaction could then safely continue because if a snapshot is happening it would either have happened before the compaction by bumping the wg and forcing the compaction to wait, or it would start after compaction has started, and then it would have a guaranteed tx that is >= to the compactions tx thus ensuring that no transactions higher than the snapshot tx make it into the snapshot.

Once we have that I think we can already get rid of the other WAL entries. Replay would simply always load the last snapshot, and then replay from the WAL up to that snapshot's tx.

@asubiotto wdyt?

asubiotto commented 1 month ago

or it would start after compaction has started, and then it would have a guaranteed tx that is >= to the compactions tx thus ensuring that no transactions higher than the snapshot tx make it into the snapshot.

So we would retain the db.Wait to ensure the compaction finishes before the snapshot starts?

Replay would simply always load the last snapshot, and then replay from the WAL up to that snapshot's tx.

s/up to/starting from?

This idea seems good to me but I'm not sure I would modify the pseudo-locking (snapshots grabbing a write txn and waiting) we just merged with this approach. Ideally we would have some way for compactions and snapshots to not have to block on each other for correctness.

By the way, I think technically we do offer snapshot isolation, just not according to our txn ids. A read might see a write that "happened after" in real time, but given that our transactions are single-operation I think that technically is snapshot isolation (maybe even serializable?) since we can just reorder the read in logical time.

Maybe the whole rearchitecture starts with removing some of the meaning we've given to txn ids and think about how we'd ideally like to design multi-op transactions. BTW there's a recent blog post about this by phil eaton which is a nice read: https://notes.eatonphil.com/2024-05-16-mvcc.html

thorfour commented 1 month ago

Ideally we would have some way for compactions and snapshots to not have to block on each other for correctness.

Yes but also I don't know if that's logically possible with compactions destroying transaction information.