mklarqvist / djinn

C++ library for analysing and storing large-scale cohorts of sequence variant data
Apache License 2.0
17 stars 1 forks source link

Djinn

Djinn is an open-source single-header C++ library for efficiently storing and analyzing large collections of genotypes and/or haplotypes.

Branch status:

Build Status Release License

Performance evaluation

For reference, our proposed compression algorithms were tested and compared against each other and against the incumbent standard Vcf and Bcf formats using a disk-based benchmark. The host architecture used is a 10 nm Cannon Lake Core i3-8121U with gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5).

Method File size (MB) Comp. Time (s) Ratio (VCF) Ratio (BCF) Decompress* (MB/s) Inflate** (MB/s) Output VCF (s)
djinn PBWT+EWAH+ZSTD-19 21.36 119.9 814.2 10.7 24119 1140 31.0
djinn PBWT+EWAH+LZ4-HC-9 26.48 93.9 656.7 8.6 25917 1143 28.2
djinn PBWT+CTX 16.19 83.0 1074.1 14.2 3757 907 34.1
djinn EWAH+LZ4-HC-9 57.05 84.9 304.8 4.0 15241 2827 19.0
djinn EWAH+ZSTD-19 41.84 207.4 415.6 5.4 12853 2667 19.9
djinn CTX 86.7 73.6 200.5 2.6 1118 858 33.2
BCF 229.9 NA 80.5 1 NA NA 148.3
VCF.gz 281.34 NA 65.8 0.82 NA NA 457.6

*Decompress: decompressing into the succinct EWAH data structure. **Inflate: decompress and inflate into byte arrays of length N (as in uBcf).

Djinn can offer strong compression with slower decompression speeds using statistical models (CTX). Reversely, the EWAH-based algorithms provide stronger decompression and query speeds with lower compression rates. The optimal algorithm dependends on its application context: whether query speeds or compression matters the most.

Compression Ratio

Example analysis of HRC-chr20 dataset. a) Output Djinn blocks results in resetting the PBWT permutation order to circumvent storing the permutation array. Each cycle (spike) corresponds to an indenpendent data block with 8192 variants for either with PBWT (green) or without (blue). There is a considerable difference in compressibility and query speed between the two approaches: PBWT-based approaches consume less disk space require more CPU time to query and PBWT-free models require more disk space but is vastly faster to query. b) Zoom in for the first five Djinn blocks in a). c) PBWT-free models can achieve hundreds of gigabytes per second inflate speeds (in Vcf-space) when operating in EWAH-space.

Building

Building Djinn requires no additional packages. For full support, building requires zstd or lz4. If you intend on using the optional support class vcf_reader.h for consuming htslib-based files then htslib is required. Build with autotools:

./autogen.sh   # Generate the header template
./configure    # Generate Makefile
make           # Build libraries and place in .libs directory
make install   # Install headers and libraries

Developers' Guide

Djinn is designed as a programming library providing simple C++ APIs to build/load an archieve and to query it. The examples/ directory contains a variety of examples that demonstrates typical uses of the C++ APIs. The library requires only a single header: djinn.h. This file contains additional detailed API documentation. Djinn aims to keep the APIs for djinn_* classes in djinn.h stable.

Implemented algorithms:

Quick guide

A simple quick start guide to encoding:

#include <djinn/djinn.h>

// Initiate new model of interest.
djinn::djinn_model* djn_ctx = new djinn::djinn_ctx_model();

// Encode Bcf-encoded data
int ret = djn_ctx->EncodeBcf(data_in, data_len, ploidy, n_allele);

// Finish encoding and serialize to stdcout
djn_ctx->FinishEncoding();
int serial_size = djn_ctx->Serialize(std::cout);
delete djn_ctx

Changing the desired algorithm is easy: replace djinn::djinn_ctx_model() with for example, djinn::djinn_ewah_model(djinn::CompressionStrategy::LZ4, 9) or djinn::djinn_ewah_model(djinn::CompressionStrategy::ZSTD, 21).

Decoding data is equally simple

#include <djinn/djinn.h>

// Initiate new model of interest.
djinn::djinn_model* djn_decode = new djinn::djinn_ctx_model();

// Deserialize model from cin
int decode_ctx_ret = djn_decode->Deserialize(std::cin);
// Primary return structure in Djinn
djinn::djinn_variant_t* variant = nullptr;

// Start decoding and retrieve variants iteratively
djn_ctx->StartDecoding();
for (int i = 0; i < djn_decode->n_variants; ++i) {
    int objs = djn_decode->DecodeNext(variant);
    assert(objs > 0);
}
delete variant;
delete djn_decode

Getting help

We are actively developing Djinn and are always interested in improving its quality. If you run into an issue, please report the problem on our Issue tracker. Be sure to add enough detail to your report that we can reproduce the problem and address it.

Algorithmic overview

The extended word-aligned hybrid (EWAH) data model was first described by @lemire:

The positional Burrows-Wheeler transform (PBWT) was described by @richarddurbin:

The idea of using statistical models in compression has been around for decades but the ideas and original code used in this project was first described by @jkbonfield:

Limitations

Note

This is a collaborative effort between Marcus D. R. Klarqvist (@klarqvist) and James Bonfield (@jkbonfield).

License

Djinn is licensed under MIT