MerkleDB
MerkleDB is a Clojure library for storing and accessing large data sets in a
hybrid column-oriented tree of content-adressable data blocks.
This project is usable, but should be considered alpha quality. For more
details, see the design doc, proposed client
interface, and sample usage patterns.
Installation
Library releases are published on Clojars. To use the latest version with
Leiningen, add the following dependency to your project definition:
This will pull in the omnibus package, which in turn depends on each subproject
of the same version. You may instead depend on the subprojects directly if you
wish to omit some functionality, such as Spark integration.
Concepts
The high-level semantics of this library are similar to a traditional key-value
data store:
- A database is a collection of tables, along with some user metadata.
- Tables are collections of records, which are identified uniquely within the
table by an id key.
- Each record is an associative collection of fields, mapping field names to
values.
- Values may have any type that the underlying serialization format supports.
There is no guarantee that all the values for a given field have the same
type.
Goals
The primary design goals of MerkleDB are:
- Flexible schema-free key-value storage.
- High-parallelism reads and writes to optimize for bulk-processing, where a
job computes over most or all of the records in the table, but possibly only
needs access to a subset of the fields in each record.
Secondary goals include:
- Efficient storage utilization via deduplication and structural sharing.
- Light-weight versioning and copy-on-write to support immutable reads.
- Building on storage and synchronization abstractions to support hosted service
backends.
Non-goals:
- High-frequency, highly concurrent writes. Initial versions will have simple
database-wide locking for updates.
- Access control. In this library, all authentication and authorization is
deferred to the storage layers backing the block store and ref manager.
License
This is free and unencumbered software released into the public domain.
See the UNLICENSE file for more information.