refresh-bio / colord

A versatile compressor of third generation sequencing reads.
GNU General Public License v3.0
49 stars 14 forks source link
bioinformatics compression fastq-files genomics long-reads oxford-nanopore pac-bio sequencing

CoLoRd - Compressing long reads

GitHub downloads Bioconda downloads GitHub Actions CI License: GPL v3

A versatile compressor of third generation sequencing reads.

Quick start

git clone --recurse-submodules https://github.com/refresh-bio/colord
cd colord && make
cd bin

INPUT=./../test

# default compression presets (lossy quality, memory priority)
./colord compress-ont ${INPUT}/M.bovis.fastq ont.default        # Oxford Nanopore
./colord compress-pbhifi ${INPUT}/D.melanogaster.fastq hifi.default # PacBio HiFi 
./colord compress-pbraw ${INPUT}/A.thaliana.fastq clr.default       # PacBio CLR/subreads

# print ONT archive information and decompress
./colord info ont.default
./colord decompress ont.default ont.fastq

# compress HiFi reads preserving original quality levels
./colord compress-pbhifi -q org ${INPUT}/D.melanogaster.fastq hifi.lossless

# compress CLR reads with ratio priority using 48 threads
./colord compress-pbraw -p ratio -t 48 ${INPUT}/A.thaliana.fastq clr.ratio

# compress ONT reads w.r.t. reference genome (embed the reference in the archive)
./colord compress-ont -G ${INPUT}/M.bovis-reference.fna -s ${INPUT}/M.bovis.fastq ont.refbased

# decompress the reference-based archive
./colord decompress ont.refbased ont.refbased.fastq

Installation and configuration

CoLoRd comes with a set of precompiled binaries for Windows, Linux, and OS X. They can be found under Releases tab. The software is also available on Bioconda:

conda install -c bioconda colord

For detailed instructions how to set up Bioconda, please refer to the Bioconda manual. CoLoRd can be also built from the sources distributed as:

To install G++ under under macOS, one can use Homebrew package manager:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install gcc@10

Before running CoLoRd on macOS, the current limit of file descriptors should be increased:

ulimit -n 2048

Usage

Compression

colord <mode> [options] <input> <archive>

Modes:

Positionals:

Options:

Advanced options (default values may depend on the mode - please run colord --help <mode> to get the details):

Hints

While the number of CoLoRd parameters is large, in most cases the default values will work just fine. In terms of compression, there is always a trade off between compression ratio and resource requirements (mainly memory and compute time). If the default behavior of CoLoRd is insufficient, the first attempt should be the change of compression priority mode (-p parameter). The compression priority modes aggregate multiple other parameters influencing compression ratio. There are the following priority modes (ordered increasingly w.r.t. the compression efficiency and resource requirements):

The memory priority mode is the default.

Quality scores have a high impact on the compression. They are hard to compress due to their nature and, at the same time (as presented in the paper) their resolution can be safely reduced without affecting downstream analyses. For this reason, in each priority mode, the quality scores are compressed lossy. If it is required to keep the original quality scores, one should use -q org. Note, that there exist several other quality compression modes (see the paper).

Here are compression results for a large set of human reads NA12878 with a total size of 268,305,314,354 bytes.

Lossy Lossless
Compressed in memory mode size [B] 42,120,596,486 105,807,350,384
Compressed in balanced mode size [B] 39,833,878,505 103,367,993,362
Compressed in ratio mode size [B] 38,832,714,102 101,305,368,675
Time in memory mode [h:mm:ss] 1:12:42 1:26:02
Time in balanced mode [h:mm:ss] 1:33:18 2:11:21
Time in ratio mode [h:mm:ss] 3:18:46 4:57:09
Memory in memory mode [KB] 13,715,168 14,341,128
Memory in balanced mode [KB] 26,728,108 27,293,824
Memory in ratio mode [KB] 97,922,208 99,133,548

If one wants to check how much CoLoRd can squeeze the input data regardless of the resource requirements, the ratio mode should be used. If more control over execution is in demand, the remaining parameters may be configured. The simplest way to settle the direction without the need to understand the meaning of parameters is to display the defaults for a given compression priority mode with --help switch. For example, let's say you want to find out if you should increase or decrease the -f parameter to improve the compression ratio while compressing ONT data. You may run CoLoRd twice with the following parameters:

./colord compress-ont --help -p balanced
./colord compress-ont --help -p ratio

You will notice the default for -f is higher for balanced mode, which means lowering it will increase the compression ratio. The same approach may be applied for other parameters (-L, -H, -c, -r, --min-to-alt, etc.).

In the ratio priority mode all the input reads may serve as a reference to encode other reads. This will increase RAM usage, especially for large datasets. In the remaining modes, only part of the reads may serve as a reference. If needed -g and -x may be used.

The values for -k and -a parameters are auto-adjusted based on the size of the data to be compressed. The general rule is, the larger the input size is, the values of these parameters should be higher.

Decompression

colord decompress [options] <archive> <output>

Positionals:

Options:

Archive information

colord info <archive>

API

CoLoRd comes with a C++ API allowing straightforward access to the existing archive. Below one can find an example of using API in the code.

#include "colord_api.h"
#include <iostream>

int main(int argc, char** argv) {
    try {
        colord::DecompressionStream stream("archive.colord");   // load a CoLoRd archive
        auto info = stream.GetInfo();               // get and print archive information
        std::cerr << "Archive info:\n\n";           //
        info.ToOstream(std::cerr);              //  

            // iterate over records in the archive
        while (auto x = stream.NextRecord()) {
            if (info.isFastq) {
                std::cout << "@" << x.ReadHeader() << "\n";
                std::cout << x.Read() << "\n";
                std::cout << "+" << x.QualHeader() << "\n";
                std::cout << x.Qual() << "\n";
            } else {
                std::cout << ">" << x.ReadHeader() << "\n";
                std::cout << x.Read() << "\n";
            }
        }
    }
    catch (const std::exception& ex) {
        std::cerr << "Error: " << ex.what() << "\n";
        return -1;
    }   
    return 0;
}

Compiling own code utilizing colord API

To use an API one needs to include colord_api.h header file and link against libcolord_api.a. libcolord_api.a uses std::threads and zlib, so -lpthreads and -lz flags are needed for linking. For example, to compile and link the code above one could use the following command:

g++ -O3 $SRC_FILE -I$INCLUDE_DIR $LIB_DIR/libcolord_api.a -lz -lpthread -o example -no-pie

where

Citing

Kokot, M., Gudyś, A., Li, H. and Deorowicz, S. (2022) CoLoRd: Compressing long reads. Nature Methods, https://doi.org/10.1038/s41592-022-01432-3