MVCC and transactional design shootout :gun: :bomb: :explosionbomb: :extrabigexplosion:

spacejam commented 6 years ago

Time for a literature shootout to determine the initial architecture for sled 0.16's transactions and snapshots! If anyone has particular insights or opinions on these, please jump in here!

There are a number of separate concerns when MVCC is mentioned, that we should be careful not to conflate (taken from the paper "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control" below):

concurrency control protocols - this is the primary thing under consideration in this issue!
version storage
- append-only storage - all versions exist in the same storage space. lock free linked list points either from oldest to newest (O2N) which suffers from cache pollution due to having to scan through old versions always, and newest to oldest (N2O) which requires an indirection table (which we have anyway). off-page storage of BLOBs can be complicated, may need reference counts.
- time-travel storage - other versions are stored in a separate table. some systems store the newest version in the master table, others store the oldest there to facilitate easy SI (SAP HANA does this). For our trade-offs, we may be fine keeping old versions in a separate, memory-only space after a transaction completes and invalidates them, and then we could use an epoch system to handle GC. This does not scale for long snapshots, but this is acceptable for the chosen tradeoffs of this system. time-travel storage can suffer from the same issues with off-page storage as append-only storage.
- delta-storage - used by mysql, oracle, and HyPer. just store the main version in storage, and then store deltas as later versions. can incur read performance hits.
- general considerations: balancing write contention and cache pollution are the major trade-offs here
GC
- vaccuum (doesn't scale but is simple)
- cooperative cleaning - when a page is read if we notice it has stale versions, tell a global GC mechanism about it, vulnerable to "dusty corners" so hekaton does this but also occasionally vacuums. only compatible with append-only storage (so we're good). requires that the mechanism that potentially cleans up data reads the oldest entries. we can implement this by having relocation calls fully read chains, while reads only access the new part of chains that they require to keep cache pollution down. right now sled reads the entire page always, which pollutes cache. this is something to flag for later.
- transaction-level gc, track read+write sets of each transaction per epoch, when epochs end, we can clear all state generated by transactions in that epoch. simple but requires tracking the epoch mapping
index management of secondary indices (2i). we are NOT going to be implementing 2i in 0.16, but we should be mindful not to make the path toward implementing them in the future needlessly complex
- logical pointer to primary key - grows in size with primary key length
- logical pointer to tuple ID - use a separate lock-free lookup table, just store 64 bit tuple ID
- physical pointer to storage. only usable with append-only storage, so we're good. used by hekaton, memsql. when we update the current version in storage, we update all 2i's to point to the new physical location.

The ideas in silo and tictoc particularly appeal to me due to the focus on reducing scalability barriers at high core counts, where we'll be headed in the coming years. A concern that will have a major impact on the implementation is how snapshots are handled. Sled is optimized for point reads and short-medium scans. It is acceptable for long scans to pay a higher GC penalty, especially if it simplifies implementation complexity. We value reliable worst case latency far more than crazy p0th-p50.

High-Performance Concurrency Control Mechanisms for Main-Memory Databases (2012)

Hekaton MVCC paper. They went with the optimistic choice here.

Speedy Transactions in Multicore In-Memory Databases (2013) (silo)

interesting techniques for supporting many-core, decentralized systems

epoch-based

it avoids all centralized contention points, including that of
centralized transaction ID assignment. Silo’s key contri-
bution is a commit protocol based on optimistic concur-
rency  control  that  provides  serializability  while  avoid-
ing all shared-memory writes for records that were only
read.  Though  this  might  seem  to  complicate  the  en-
forcement of a serial order, correct logging and recov-
ery is provided by linking periodically-updated epochs
with the commit protocol. Silo provides the same guar-
antees as any serializable database without unnecessary
scalability bottlenecks or much additional latency.

Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (2015)

keep old versions in undo buffers

We present a novel MVCC implementation for main-mem-
ory database systems that has very little overhead compared
to serial execution with single-version concurrency control,
even when maintaining serializability guarantees.  Updating
data in-place and storing versions as before-image deltas in
undo buffers not only allows us to retain the high scan per-
formance  of  single-version  systems  but  also  forms  the  ba-
sis  of  our  cheap  and  fine-grained  serializability  validation
mechanism.   The  novel  idea  is  based  on  an  adaptation  of
precision  locking  and  verifies  that  the  (extensional)  writes
of recently committed transactions do not intersect with the
(intensional) read predicate space of a committing transac-
tion.  We experimentally show that our MVCC model allows
very  fast processing of  transactions with point accesses
as well as read-heavy transactions and that there is little need
to prefer SI over full serializability any longer.

High performance transactions via early write visibility (2017)

restrict transaction abort conditions to avoid delayed write visibility delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write visibility, which safely makes transactions' writes visible prior to the end of their execution. We design a new serializable concurrency control protocol, piece-wise visibility (PWV), with the explicit goal of enabling early write visibility. We evaluate PWV against state-of-the-art serializable protocols and a highly optimized implementation of read committed, and find that PWV can outperform serializable protocols by an order of magnitude and read committed by 3X on high contention workloads.

Efficiently making (almost) any concurrency control mechanism serializable (2017)

certifier approach

We propose the serial safety net (SSN), a
serializability-enforcing certifier which can be applied on top of var-
ious CC schemes that offer higher performance but admit anomalies,
such as snapshot isolation and read committed. The underlying CC
mechanism retains control of scheduling and transactional accesses,
while SSN tracks the resulting dependencies. At commit time, SSN
performs a validation test by examining only direct dependencies
of the committing transaction to determine whether it can commit
safely or must abort to avoid a potential dependency cycle.

An Empirical Evaluation of In-Memory Multi-Version Concurrency Control (2017)

a survey of trade-offs, mentions a few of the previous contestants here

we conduct an extensive study of the
scheme’s four key design decisions: concurrency control protocol,
version storage, garbage collection, and index management.  We
implemented state-of-the-art variants of all of these in an in-memory
DBMS and evaluated them using OLTP workloads.  Our analysis
identifies the fundamental bottlenecks of each design choice

Transaction Healing: Scaling Optimistic Concurrency Control on Multicores (2016)

we propose a new concurrency-control mechanism, 
called transaction healing, that exploits program
semantics to scale the conventional OCC towards 
dozens of cores even under highly contended workloads.
Transaction healing captures the dependencies across 
operations within a transaction prior to its execution. 
Instead of blindly rejecting a transaction once its validation 
fails, the proposed mechanism judiciously restores any
non-serializable operation and heals inconsistent transaction 
states as well as query results according to the extracted 
dependencies. Transaction healing can partially update the 
membership of read/write sets when processing dependent 
transactions. Such overhead, however, is largely reduced by 
carefully avoiding false aborts and rearranging validation 
orders. We implemented the idea of transaction healing in 
TheDB, a main-memory database prototype that provides 
full ACID guarantee with a scalable commit protocol. By 
evaluating TheDB on a 48-core machine with two widely-used 
benchmarks, we confirm that transaction healing can scale 
near-linearly, yielding significantly higher transaction rate than 
the state-of-the-art OCC implementations.

TicToc: Time Traveling Optimistic Concurrency Control (2016)

avoid centralized timestamp ordering

Previous research has shown that timestamp management is the key
scalability bottleneck in concurrency control algorithms. This pre-
vents the system from scaling to large numbers of cores.
In this paper we present TicToc, a new optimistic concurrency
control algorithm that avoids the scalability and concurrency bot-
tlenecks of prior T/O schemes. TicToc relies on a novel and prov-
ably correct data-driven timestamp management protocol. Instead
of assigning timestamps to transactions, this protocol assigns read
and write timestamps to data items and uses them to lazily com-
pute a valid commit timestamp for each transaction. TicToc re-
moves the need for centralized timestamp allocation, and commits
transactions that would be aborted by conventional T/O schemes.

spacejam commented 6 years ago

Contention-Aware Lock Scheduling for Transactional Databases

innodb recently added CATS (originally VATS) support, for scheduling lock acquisition based on a dependency graph and picking transactions with the most downstream dependencies.

not sure how this approach benefits a system targeted from day-1 for modern architectures, but it seems to be a very powerful incremental improvement for the innodb engine

we present the concept of contention-aware scheduling,
show the hardness of the problem, and propose novel lock
scheduling algorithms (LDSF and bLDSF), which guarantee
a constant factor approximation of the best scheduling.  We
conduct extensive experiments using a popular database on
both TPC-C and a microbenchmark.  Compared to FIFO—
the default scheduler in most database systems—our bLDSF
algorithm yields up to 300x speedup in overall transaction
latency.   On  the  other  hand,  our  LDSF  algorithm,  which
is simpler and achieves comparable performance to bLDSF,
has already been adopted by open-source community,  and
chosen as default scheduling strategy in MySQL 8.0.3+.

spacejam commented 6 years ago

This nice slidedeck from Andy Pavlo covers the high-level implementations of hekaton, hyper, and cicada.

spacejam commented 6 years ago

Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores (2014)

o better understand just how unprepared current DBMSs are for
future CPU architectures, we performed an evaluation of concur-
rency control for on-line transaction processing (OLTP) workloads
on many-core chips.  We implemented seven concurrency control
algorithms on a main-memory DBMS and using computer simula-
tions scaled our system to 1024 cores.  Our analysis shows that all
algorithms fail to scale to this magnitude but for different reasons.
In each case, we identify fundamental bottlenecks that are indepen-
dent of the particular database implementation and argue that even
state-of-the-art DBMSs suffer from these limitations. We conclude
that  rather  than  pursuing  incremental  solutions,  many-core  chips
may  require  a  completely  redesigned  DBMS  architecture  that  is
built from ground up and is tightly coupled with the hardware.

Another cool Pavlo deck about the work from "staring into the abyss"

spacejam commented 6 years ago

Cicada: Dependably Fast Multi-Core In-Memory Transactions

Cicada is a single-node multi-core in-memory transactional data-
base with serializability. To provide high performance under diverse
workloads, Cicada reduces overhead and contention at several levels
of the system by leveraging optimistic and multi-version concur-
rency control schemes and multiple loosely synchronized clocks
while mitigating their drawbacks. On the TPC-C and YCSB bench-
marks, Cicada outperforms Silo, TicToc, FOEDUS, MOCC, two-
phase locking, Hekaton, and ERMIA in most scenarios, achieving
up to 3X higher throughput than the next fastest design.

Seems like the current contender.

spacejam commented 6 years ago

I'm aiming toward a cicada-like architecture, but with simplified timestamp generation initially just using an atomic fetch_add. We can also support causal consistency with zero centralized timestamp contention by just using a higher timestamp than what we find in our initial reads and tracking a thread-local max timestamp encountered, plus a per-thread identifier in the low bits for guaranteeing timestamp uniqueness. We have a LOT of tuning to go before this becomes a bottleneck.

spacejam / sled

MVCC and transactional design shootout :gun: :bomb: :explosionbomb: :extrabigexplosion: #232