PgRC: Pseudogenome based Read Compressor

Pseudogenome-based Read Compressor (PgRC) is an in-memory algorithm for compressing the DNA stream of FASTQ datasets, based on the idea of building an approximation of the shortest common superstring over high-quality reads.

The implementation supports constant-length reads limited to 255 bases.

Installation on Linux - manual build

The following steps create an PgRC executable. On Linux PgRC build requires installed cmake version >= 3.5 (check using cmake --version):

git clone https://github.com/kowallus/PgRC.git
cd PgRC
mkdir build
cd build
cmake ..
make PgRC

Basic usage

PgRC [-i <seqSrcFile> [<pairSrcFile>]] [-t <noOfThreads>] [-o] [-d] <archiveName>

   -o preserve original read order information
   -t number of threads used
   -d decompression mode

compression of DNA stream in order non-preserving regime (SE mode):

./PgRC -i in.fastq comp.pgrc

compression of DNA stream in order preserving regime (SE_ORD mode):

./PgRC -o -i in.fastq comp.pgrc

compression of paired-end DNA stream in order non-preserving regime (PE mode):

./PgRC -i in1.fastq in2.fastq comp.pgrc

compression of paired-end DNA stream in order preserving regime (PE mode):

./PgRC -o -i in1.fastq in2.fastq comp.pgrc

decompression of DNA stream to the current folder:

./PgRC -d comp.pgrc

Publications

Tomasz M. Kowalski, Szymon Grabowski: PgRC: pseudogenome-based read compressor. Bioinformatics, Volume 36, Issue 7, pp. 2082–2089 (2020).

supplementary data

bioRxiv

Related projects

PgSA - Pseudogenome Suffix Array

kowallus / PgRC

readme

PgRC: Pseudogenome based Read Compressor

Installation on Linux - manual build

Basic usage

Publications

Related projects