Djinn is an open-source single-header C++ library for efficiently storing and analyzing large collections of genotypes and/or haplotypes.
For reference, our proposed compression algorithms were tested and compared against each other and against the incumbent standard Vcf and Bcf formats using a disk-based benchmark. The host architecture used is a 10 nm Cannon Lake Core i3-8121U with gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5).
Method | File size (MB) | Comp. Time (s) | Ratio (VCF) | Ratio (BCF) | Decompress* (MB/s) | Inflate** (MB/s) | Output VCF (s) |
---|---|---|---|---|---|---|---|
djinn PBWT+EWAH+ZSTD-19 | 21.36 | 119.9 | 814.2 | 10.7 | 24119 | 1140 | 31.0 |
djinn PBWT+EWAH+LZ4-HC-9 | 26.48 | 93.9 | 656.7 | 8.6 | 25917 | 1143 | 28.2 |
djinn PBWT+CTX | 16.19 | 83.0 | 1074.1 | 14.2 | 3757 | 907 | 34.1 |
djinn EWAH+LZ4-HC-9 | 57.05 | 84.9 | 304.8 | 4.0 | 15241 | 2827 | 19.0 |
djinn EWAH+ZSTD-19 | 41.84 | 207.4 | 415.6 | 5.4 | 12853 | 2667 | 19.9 |
djinn CTX | 86.7 | 73.6 | 200.5 | 2.6 | 1118 | 858 | 33.2 |
BCF | 229.9 | NA | 80.5 | 1 | NA | NA | 148.3 |
VCF.gz | 281.34 | NA | 65.8 | 0.82 | NA | NA | 457.6 |
*Decompress: decompressing into the succinct EWAH data structure. **Inflate: decompress and inflate into byte arrays of length N (as in uBcf).
Djinn can offer strong compression with slower decompression speeds using statistical models (CTX). Reversely, the EWAH-based algorithms provide stronger decompression and query speeds with lower compression rates. The optimal algorithm dependends on its application context: whether query speeds or compression matters the most.
Example analysis of HRC-chr20 dataset. a) Output Djinn blocks results in resetting the PBWT permutation order to circumvent storing the permutation array. Each cycle (spike) corresponds to an indenpendent data block with 8192 variants for either with PBWT (green) or without (blue). There is a considerable difference in compressibility and query speed between the two approaches: PBWT-based approaches consume less disk space require more CPU time to query and PBWT-free models require more disk space but is vastly faster to query. b) Zoom in for the first five Djinn blocks in a). c) PBWT-free models can achieve hundreds of gigabytes per second inflate speeds (in Vcf-space) when operating in EWAH-space.
Building Djinn requires no additional packages. For full support, building requires zstd or lz4. If you intend on using the optional support class vcf_reader.h
for consuming htslib
-based files then htslib is required. Build with autotools:
./autogen.sh # Generate the header template
./configure # Generate Makefile
make # Build libraries and place in .libs directory
make install # Install headers and libraries
Djinn is designed as a programming library providing simple C++ APIs to build/load an archieve and to query it. The examples/
directory contains a variety of examples that demonstrates typical uses of the C++ APIs. The library requires only a single header: djinn.h
. This file contains additional detailed API documentation.
Djinn aims to keep the APIs for djinn_*
classes in djinn.h
stable.
Implemented algorithms:
A simple quick start guide to encoding:
#include <djinn/djinn.h>
// Initiate new model of interest.
djinn::djinn_model* djn_ctx = new djinn::djinn_ctx_model();
// Encode Bcf-encoded data
int ret = djn_ctx->EncodeBcf(data_in, data_len, ploidy, n_allele);
// Finish encoding and serialize to stdcout
djn_ctx->FinishEncoding();
int serial_size = djn_ctx->Serialize(std::cout);
delete djn_ctx
Changing the desired algorithm is easy: replace djinn::djinn_ctx_model()
with for example,
djinn::djinn_ewah_model(djinn::CompressionStrategy::LZ4, 9)
or djinn::djinn_ewah_model(djinn::CompressionStrategy::ZSTD, 21)
.
Decoding data is equally simple
#include <djinn/djinn.h>
// Initiate new model of interest.
djinn::djinn_model* djn_decode = new djinn::djinn_ctx_model();
// Deserialize model from cin
int decode_ctx_ret = djn_decode->Deserialize(std::cin);
// Primary return structure in Djinn
djinn::djinn_variant_t* variant = nullptr;
// Start decoding and retrieve variants iteratively
djn_ctx->StartDecoding();
for (int i = 0; i < djn_decode->n_variants; ++i) {
int objs = djn_decode->DecodeNext(variant);
assert(objs > 0);
}
delete variant;
delete djn_decode
We are actively developing Djinn and are always interested in improving its quality. If you run into an issue, please report the problem on our Issue tracker. Be sure to add enough detail to your report that we can reproduce the problem and address it.
The extended word-aligned hybrid (EWAH) data model was first described by @lemire:
The positional Burrows-Wheeler transform (PBWT) was described by @richarddurbin:
The idea of using statistical models in compression has been around for decades but the ideas and original code used in this project was first described by @jkbonfield:
This is a collaborative effort between Marcus D. R. Klarqvist (@klarqvist) and James Bonfield (@jkbonfield).
Djinn is licensed under MIT