Transactional MVCC Pagecache

spacejam commented 6 years ago

To fully support high-level transactions that can execute consistent scans in addition to reads and writes, we need to be able to materialize phantom conflicts. (background on phantoms included at the bottom of this description)

Requirements:

detect phantoms by including the index leaves involved in transactions in the conflict set
only force conflicts for scan-like read dependencies, not for reads of different specific keys that hit the same index leaf
impose a minimal performance cost on non-transactional workloads
rewriting / paging-out a page in the cache should not conflict with high-level transactions
conflicts don't need to be materialized for lock-free index node operations above the leaf layer
index leaf node splits and merges should cause conflicts on scan-like workloads

Implementation Sketch:

create generic transactor that follows the cicada strategy of dynamic pessimism, do more work to force early conflict detection when contention is detected via a tiny thread-local contention tracker that gets incremented up to 5 when conflicts happen, and decremented down to 0 using saturating arithmetic when operations are successful.
the generic transactor reasons about leaf index node id's
index access is proxied through the transactor

The simplest implementation would only use node id's in transaction read and write sets. This results in cheap scans, because they only include node id's in the transaction's read set. It causes unnecessary conflicts between point reads and writes (and writes with other writes) that involve different keys but are colocated on the same node. But in low-contention situations it really reduces overhead a lot.

The transactor can work on a set of shards far smaller than the pagetable.

items

[x] pagecache txid generator
[x] pagecache txid recovery from writes
[x] writebatch support
[x] atomic writebatch in crash test, setting all keys to the same value
[ ] epoch-based low-water mark handling: from defer, ensure low-water-mark >= txid
[x] page -> multiple versions, atomic read tx
[ ] cicada-like transaction protocol that is aware of contention, but avoiding distributed txid generation because we're nowhere close to 100m transactions per second
[ ] page transaction API
[ ] page consolidation cleans up stale versions below low-water-mark
[x] sled transaction API

Background on Phantoms:

tx 5 begins
the thread executing tx 5 is descheduled by the OS
tx 6 begins
tx 6 reads a range of keys
tx 5 resumes, and writes a new key into the middle of the range that tx 6 read
either tx 5 or tx 6 must abort, because for the transaction order to be the serial order of execution that we are reasoning about, all writes below tx 6 need to have been properly read, but the write from tx 5 was not. If tx 5 aborts, then tx 6 read all "previously" committed writes. If tx 6 aborts, then the new key written by 5 is never skipped by a later transaction. If both commit, we have violated serializability.

SaulDoesCode commented 5 years ago

All I want for Xmas are transactions, how far off are they?

Sorry if this is inappropriate. I'm very impressed by sled it's just fantastic and I want to incorporate it into my app as an embedded replacement for some of the things Redis and Mongo did.

The only thing blocking at the moment is having cancelable ACID transactions. I've been using the flush function a lot, but, it's not enough, especially when a set of mutations are incomplete and you want to cancel the whole batch when one fails to validate or load from the network. It's a bit of a nightmare.

Do you have any ETA on transactions?

Great work by the way, thank you for your contributions to open source. Keep it up, sled is awesome.

spacejam commented 5 years ago

@SaulDoesCode not inappropriate at all! And thank you for the kind words :D

Transactions are a very high priority right now. Here's the status of the different pieces that need to come together at various stages to complete the picture:

Protocol

After a tour of the literature (#232) I've started working toward designs that are similar to Cicada. I've written an STM library that is testing out some of these ideas, and more importantly, showing me how to build them in a testable way. The implementation of this is done, and I feel pretty comfortable with the safety of the implementation. I'll release this this week and write up a little blog post about it, since it ties into the "fearless concurrency" goals of the rust ecosystem in general.

API

Within a transaction, there would ideally be a way to access an API that is identical to the current Tree API, but having it be transactional. This gets a little interesting when combining transactional semantics with merge operators, and we need to pay close attention to the literature to avoid phantom anomalies (having one tx scan while another transaction adds a new item in the middle of the scan that was missed, etc...) but a matching API is the target. I may pull out the API into a common trait that will later also be applicable to distributed / replicated systems, but that's a future consideration.

Crash Safety

This part is pretty straightforward. We basically need to make sure that transactions are atomically entirely applied or rolled back during recovery. As long as our transaction protocol controls access to items and allows us to mark transactions as pending -> {aborted, committed} in a recoverable way.

The naive version of this just boils down to writing a special key that describes in-progress transactions, and during recovery we scan over these in order of transaction ID, ensuring all aborted transactions have their pending state removed, cleaning up intermediate state for committed transactions, and removing all transactions at and after the first pending transaction encountered during in-order recovery. We'll use something like epoch-based reclamation to track the low-water mark for transactional state that we can clean up.

Testing

Part of building the STM library loosely based on Cicada is to learn how to build a transactional system that is testable, ideally as exhaustively as possible. Right now I'm porting over some tests that are similar to hermitage to ensure that we're preventing unserializable behavior.

Right now the implementation uses a mix of probabilistic (random sleep) concurrent interleaving tests, as well as a few exhaustive scheduled tests that work over literally every single possible interleaving for a subset of functionality at a specific number of threads. Exhaustive testing very quickly hits a wall where it would take 10 years to complete a test, but this has been used to successfully locate a few bugs during development.

High-Level transactional DB

The first thing to be implemented will be the transactional DB. This will basically be a new thing just built on top of the sled Tree but having the same API. There are no architectural blockers for this, now that we have a monotonic ID generator and the ability to reverse iterate. Maybe this will be out by xmas :] The main "interesting" parts of this are representing merges and materializing conflicts for scans, but they are only slightly interesting and I don't see them as being blockers.

Low-Level Transactional Pagecache

This issue is about a transactional pagecache, which applies transactional ideas to a lower level within sled. This involves a lot more architectural modification, and to be honest I'm not sure if it will ever be complete, especially if the transactional DB solves all needs, and we don't have any big users requesting a transactional pagecache. The main blockers for this involve a pretty major overhaul of how state is modified throughout the entire system. Some of these blockers are necessary for long-term efficient disk and memory usage anyway, but others may actually impose a performance hit unless I can figure out how to avoid it carefully. Basically, lots of complexity that also kind of goes against the "pay only for what you use" philosophy in Rust.

spacejam commented 5 years ago

After months of thought around this, a few opportunities (constraints) have popped up:

we want to build more than just b-trees on top of the pagecache: relational tables, hash tables, merkle trees, geospatial and metric structures, graphs, queues, logs, etc...
we want to stabilize the low-level on-disk format sooner than later so that we can make sled beta and releasable at 1.0 sooner
we want to be able to run transactions across different kinds of high-level structures in the same transaction
we want to support an HTAP mix of workloads (some long-running queries mixed with low-latency point queries)

Given these rather high goals, it's clear that an MVCC pagecache is probably the best way forward, and to invest effort on the low-level components first, before a high-level transactional API.

tldr; this ticket is being prioritized as a way to reach file format forward compatibility ASAP

spacejam commented 5 years ago

Initial sketches of versioning are underway in the tyler_versions branch. Basically, I've wanted to falsify a few competing designs first to make sure that the complexity of versioning pushed to the pagecache layer is worthwhile.

implementation ideas for the next round:

log-pegging (needs #430 to not deadlock with huge transactions)
move cicada-style versioning in-memory, accessible on the pagetable entry for a page
transparently make pagecache keys refer to read_wts
transparently make sled_sync::Guard's generate a transactional timestamp in addition to the crossbeam_epoch Guard
when fixing-up pages in-memory during reads, only page-out old MergedResident's below the low-water mark
each node in the pagecache chain has its wts
we can have multiple MergedResidents, but we need to keep track of which ones were replacements under the hood

protocol:

pin Guard, generate timestamp
perform reads, getting the read_wts of reads from the pagecache
cache writes locally until commit time
at commit time, stage prospective frags in the cicada version chain in-memory
validate
unstage or log + commit

nice things about this:

no need to write more than 15 bytes (log message header) to get atomicity of multiple writes, due to pegging. 0 storage overhead for updates with just one item, which is the majority. zero changes to recovery / page accounting logic.
no change in pagecache tenant (tree) code for existing workloads.
makes tx submission a Vec<(page_id_in_writeset, read_wts, new_frag)>, Vec<page_id_in_readset>

spacejam commented 5 years ago

~~pegging can also be implemented by just delaying the completion of the very first reservation made in the batch, so it's actually 0 overhead~~

it's a little trickier than that, since recovery must also be atomic. the solution that is currently implemented basically takes out a reservation before doing any writes in a batch, then loads all of the batch writes into io buffers, then gets the current reserved max lsn and writes that into the first reserved slot. during recovery, if we hit that peg, it will contain the tip that must also be present if we are going to continue recovery. this keeps recovery overhead very low.

there is just 1 component necessary before writebatches now:

concurrency control where we prevent writes on the same keys from being interleaved in the middle of a writebatch. this can be implemented in a naive way with just a big vector of mutexes that get taken out for each page involved in a transaction

this can be implemented completely in-memory and can happen either just in sled or as full pagecache transactions. just having a sharded mutex on sled for now is fast to implement.

spacejam commented 5 years ago

Transactions are now in 0.26! They are implemented in an incredibly naive and low-performance way, and eventually may rely on this MVCC pagecache to become fully-featured.

Keeping this issue open, because this is possibly the endgame for transactions, but for now people can use a transactional interface for the correctness aspects!

Firstyear commented 5 years ago

Hey there! Great work on enabling transactions - I'm looking forward to reading the source later.

I read the documentation and noticed the theme seems to be focused on transactional trees such as https://docs.rs/sled/0.26.0/sled/struct.Tree.html#method.transaction. I can see a lot of really valid use cases for this in many applications.

Saying this I can also see how a larger-single db transaction could be useful. For example, I'm working on a no-sql style database (not currently using sled, but I want to convert!), and indexes for attributes would cause me to need to do something like:

(&id2entry, &index_name, &index_uid, &index_displayname, ....).transaction(|(...)| {
}

So it could become very unwieldy, really quickly! To compound matters, users can define indexes to attributes and tree names "I don't know of" yet. So I can't really hard code them into variables, which would make transaction initialisation tricky.

I think what would be really useful for an interface like this could be a db-wide transaction, which also makes tree creation and removal transactional. Some hypothetical code could be:

let db = Db::start(config).unwrap();

db.transaction(|db| {
    let index_name = db.open_tree("index_name"); // maybe this is created now?
    ... // do work    
    let id2entry = db.open_tree("id2entry");
    ... // do work    

    // perhaps we removed an attribute from indexing?
    db.drop_tree("index_removed");
})

This way tree removal is part of the transaction, as is creation. For example, in my system if someone was to remove an attribute from indexing in the nosql layer, then we want to remove the relevant disk-backed indexing, but only if the transaction as a whole completes - and of course, respecting other transactions that may be read-only and exist that are still reading from the tree we are about to remove.

Finally, it would be great to have a separation between rw-intent transactions and ro-transactions, which is an important distinction in mvcc as they would take different strategies for page management. It would also allow applications to rely on sled as the concurrency management layer, rather than requiring their own locking or use of cowcell or other primitives.

What do you think about these ideas? I'd really like to help contribute to these in some way if possible too, but as I'm sure you know, time is always the enemy of good things :)

Thanks again for your amazing work!

spacejam commented 5 years ago

Hey @Firstyear, all of those possibilities around tree-level operations become possible with transactional concerns being pushed into the pagecache!

Applications can already rely on sled for their concurrency management layer, as all operations with externally-visible effects are fully atomic and threadsafe.

sled does not use a traditional fixed-page size B+Tree, or old techniques for concurrency control. This issue is about applying cicada on a Bw-Tree, and combining the LLAMA page fragment and cicada version chains into a single view over pages. Read-only and writing transactions follow the same path in this approach.

While fine-grained optimistic concurrency control is likely to happen soon and follow the existing API, going beyond this in any sort of defined timescale can't happen without a more sustainable funding situation for the project than my nights and weekends.

Firstyear commented 5 years ago

Thanks for the detailed answer - what would be the next step then to enable the "pseudo" code that I provided? Open a ticket I think with the design and ideas and then start to dig into it from there?

EDIT: To clarify, I meant for myself to research into how to enable this rather than yourself. I understand time is a limited resource, and I hope that soon I can donate some of mine.

spacejam commented 5 years ago

@Firstyear for the code you posted to work, with open_tree being transactional, everything the pagecache does needs to change from a lock-free approach to a batched transaction approach, and every high-level tree operation needs to end up in a transactional operation structure that records the read timestamps for all pages as they are accumulated. Then, before any writes are published, a concurrency control system must be used to validate the transaction before publishing it in the pagecache. It's a fairly extensive architectural change that will touch almost everything, and it's not easy to do as an isolated task, unfortunately.

tomtau commented 5 years ago

Just a question -- the current transaction API (0.26) works on tuples of trees. Is there a simple way to make it work on vectors of trees?

spacejam commented 5 years ago

@tomtau While this would be simple right now, if trees are parameterized by deserialized value types, it would no longer work. I'm intentionally keeping it totally type agnostic via tuples for now to make future development easier.

asonix commented 5 years ago

While I know this isn't related to this thread per-se, if you're planning to enable Tree<Value>, are you thinking of doing this with Serde? Currently it's trivial to do something like struct Tree<V>(sled::Tree, PhantomData<V>); and wrap the inner tree operations with serde conversions.

ordian commented 5 years ago

@tomtau While this would be simple right now, if trees are parameterized by deserialized value types, it would no longer work. I'm intentionally keeping it totally type agnostic via tuples for now to make future development easier.

Is there any reason why Transactional can't be implemented both for type agnostic tuples and vectors of the same type (Vec<Tree<Value>>)?

spacejam commented 5 years ago

@ordian that should be pretty easy, as long as all trees are the same type.

spacejam / sled