steveyen / gkvlite

Simple, ordered, key-value persistence library for the Go Language
MIT License
264 stars 27 forks source link

gkvlite

gkvlite is a simple, ordered, ACID, key-value persistence library for Go.

GoDoc Build Status Coverage Status

Overview

gkvlite is a library that provides a simple key-value persistence store, inspired by SQLite and CouchDB/Couchstore.

gkvlite has the following features...

Key concepts

ACID properties

Performance

Snapshots

Concurrency

The simplest way to use gkvlite is in single-threaded fashion, such as by using a go channel or other application-provided locking to serialize access to a Store.

More advanced users may want to use gkvlite's support for concurrent goroutines. The idea is that multiple read-only goroutines, a single read-write (mutation) goroutine, and a single persistence (flusher) goroutine do not need to block each other.

More specifically, you should have only a single read-write (or mutation) goroutine per Store, and should have only a single persistence goroutine per Store (doing Flush()'s). And, you may have multiple, concurrent read-only goroutines per Store (doing read-only operations like Get()'s, Visit()'s, Snapshot()'s, CopyTo()'s, etc).

A read-only goroutine that performs a long or slow read operation, like a Get() that must retrieve from disk or a range scan, will see a consistent, isolated view of the collection. That is, mutations that happened after the slow read started will not be seen by the reader.

IMPORTANT: In concurrent usage, the user must provide a StoreFile implementation that is concurrent safe.

Note that os.File is not a concurrent safe implementation of the StoreFile interface. You will need to provide your own implementation of the StoreFile interface, such as by using a channel to serialize StoreFile method invocations.

Finally, advanced users may also use a read-write (mutation) goroutine per Collection instead of per Store. There should only be, though, only a single persistence (flusher) goroutine per Store.

Other features

LICENSE

Open source - MIT licensed.

Examples

import (
    "os"
    "github.com/steveyen/gkvlite"
)

f, err := os.Create("/tmp/test.gkvlite")
s, err := gkvlite.NewStore(f)
c := s.SetCollection("cars", nil)

// You can also retrieve the collection, where c == cc.
cc := s.GetCollection("cars")

// Insert values.
c.Set([]byte("tesla"), []byte("$$$"))
c.Set([]byte("mercedes"), []byte("$$"))
c.Set([]byte("bmw"), []byte("$"))

// Replace values.
c.Set([]byte("tesla"), []byte("$$$$"))

// Retrieve values.
mercedesPrice, err := c.Get([]byte("mercedes"))

// One of the most priceless vehicles is not in the collection.
thisIsNil, err := c.Get([]byte("the-apollo-15-moon-buggy"))

// Iterate through items.
c.VisitItemsAscend([]byte("ford"), true, func(i *gkvlite.Item) bool {
    // This visitor callback will be invoked with every item
    // with key "ford" and onwards, in key-sorted order.
    // So: "mercedes", "tesla" are visited, in that ascending order,
    // but not "bmw".
    // If we want to stop visiting, return false;
    // otherwise return true to keep visiting.
    return true
})

// Let's get a snapshot.
snap := s.Snapshot()
snapCars := snap.GetCollection("cars")

// The snapshot won't see modifications against the original Store.
c.Delete([]byte("mercedes"))
mercedesIsNil, err := c.Get([]byte("mercedes"))
mercedesPriceFromSnapshot, err := snapCars.Get([]byte("mercedes"))

// Persist all the changes to disk.
s.Flush()

f.Sync() // Some applications may also want to fsync the underlying file.

// Now, other file readers can see the data, too.
f2, err := os.Open("/tmp/test.gkvlite")
s2, err := gkvlite.NewStore(f2)
c2 := s.GetCollection("cars")

bmwPrice, err := c2.Get([]byte("bmw"))

Tips

Because all collections are persisted atomically when you flush a store to disk, you can implement consistent secondary indexes by maintaining additional collections per store. For example, a "users" collection can hold a JSON document per user, keyed by userId. Another "userEmails" collection can be used like a secondary index, keyed by "emailAddress:userId", with empty values (e.g., []byte{}).

Bulk inserts or batched mutations are roughly supported in gkvlite where your application should only occasionally invoke Flush() after N mutations or after a given amount of time, as opposed to invoking a Flush() after every Set/Delete().

Implementation / design

The fundamental data structure is an immutable treap (tree + heap).

When used with random heap item priorities, treaps have probabilistic balanced tree behavior with the usual O(log N) performance bounds expected of balanced binary trees.

The persistence design is append-only, using ideas from Apache CouchDB and Couchstore / Couchbase, providing a simple approach to reaching ACID properties in the face of process or machine crashes. On re-opening a file, the implementation scans the file backwards looking for the last good root record and logically "truncates" the file at that point. New mutations are appended from that last good root location. This follows the MVCC (multi-version concurrency control) and "the log is the database" approach of CouchDB / Couchstore / Couchbase.

TRADEOFF: the append-only persistence design means file sizes will grow until there's a compaction. To get a compacted file, use CopyTo() with a high "flushEvery" argument.

The append-only file format allows the FlushRevert() API (undo the changes on a file) to have a simple implementation of scanning backwards in the file for the next-to-last good root record and physically truncating the file at that point.

TRADEOFF: the append-only design means it's possible for an advanced adversary to corrupt a gkvlite file by cleverly storing the bytes of a valid gkvlite root record as a value; however, they would need to know the size of the containing gkvlite database file in order to compute a valid gkvlite root record and be able to force a process or machine crash after their fake root record is written but before the next good root record is written/sync'ed.

The immutable, copy-on-write treap plus the append-only persistence design allows for fast and efficient MVCC snapshots.

TRADEOFF: the immutable, copy-on-write design means more memory garbage may be created than other designs, meaning more work for the garbage collector (GC).

TODO / ideas