refresh-bio / Whisper

GNU General Public License v3.0
24 stars 4 forks source link

Whisper2

GitHub downloads GitHub Actions CI License: GPL v3

Quick start

# download and unpack E.coli str. K-12 substr. MG1655 reference genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/904/425/475/GCF_904425475.1_MG1655/GCF_904425475.1_MG1655_genomic.fna.gz
gzip -k -d GCF_904425475.1_MG1655_genomic.fna.gz 

# download NovaSeq 6000 reads from E.coli MG1655 IR-10-94 population sequencing
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR100/030/SRR10051130/SRR10051130_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR100/030/SRR10051130/SRR10051130_2.fastq.gz

# clone and build Whisper
git clone https://github.com/refresh-bio/Whisper
make -C Whisper/src

# build an index named ecoli for the reference genome
mkdir index
mkdir temp
./Whisper/src/whisper-index ecoli GCF_904425475.1_MG1655_genomic.fna ./index ./temp

# map reads to the reference genome and store the results in mappings.sam
./Whisper/src/whisper -rp -out mappings -temp ./temp/ ./index/ecoli  SRR10051130_1.fastq.gz SRR10051130_2.fastq.gz

Note, that Whisper was optimized for processing data from sequencing large samples with high coverage (e.g. human).

Installation and configuration

Whisper comes with a set of precompiled binaries for Windows and Linux. They can be found under Releases tab.

The software can be also built from the sources distributed as:

NOTE

Linux systems limit number of files that can be opened by the process. Make sure this limit is sufficient for Whisper, which requires:

num_files = num_bins (384 by default) + num_threads + 2 * total_size_of_FASTQ_in_GB

E.g., if sample reads in uncompressed FASTQ format have 100GB and the processing is done by 16 CPU threads, Whisper opens 600 files. To change the limit use ulimit Linux command:

ulimit -n 600

Usage

The preliminary step of the analysis, performed only once for a given reference genome, is construction of an index. The index may be then used for mapping reads from different samples to the reference.

Indexing reference genome

Indexing can be executed in two wariants, depending on the representation of the reference (single versus multiple FASTA files):

whisper-index <index_name> <ref_seq_file_name> <dest_dir> <temp_dir>

whisper-index <index_name> @<ref_seq_files_name> <dest_dir> <temp_dir>

Parameters:

Examples:

Generates index named hg38-chr20 for chr20.fa reference sequence and places it in index-dir directory.

Generates index named hg38 for all FASTA files listed in hg38.list file and places it in index-dir directory.

Mapping reads

whisper [options] <index_name> @<files>

whisper [options] <index_name> file_se

whisper [options] <index_name> file_pe_1 file_pe_2

Parameters:

Options:

Examples:

Maps paired-end reads from reads_1.fq and reads_2.fq FASTQ files using hg38 index. Computations are distributed over 12 threads, results are stored in result.sam file.

Maps single-end reads from FASTQ files listed in reads_se.list file using hg38 index. The example contents of reads_se.list file:

readsA
readsB
readsC
...

Maps paired-end reads from FASTQ files listed in reads_pe.list file using hg38 index. The example contents of reads_pe.list file:

readsA_1;readsA_2
readsB_1;readsB_2
readsC_1;readsC_2
...

Citing

Deorowicz, S., Debudaj-Grabysz, A., Gudyś, A., Grabowski, S. (2018) Whisper: Read sorting allows robust mapping of sequencing data, Bioinformatics, 35(12):2043–2050

Deorowicz, S., Gudyś, A. (2021) Whisper 2: Indel-sensitive short read mapping, SoftwareX, 14:100692