mazzalab / fastqwiper

An ensemble method to recover corrupted FASTQ files, drop or fix pesky lines, remove unpaired reads, and settle reads interleaving.
GNU General Public License v3.0
24 stars 3 forks source link
bioinformatics corrupted fastq fix ngs recovery

FastqWiper

Build codecov GitHub issues

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

PyPI version PyPI pyversions PyPI - Downloads

Docker Docker Pulls

FastqWiper recovers corrupted fastq.gz, drops or fixes pesky lines, removes unpaired reads, and settles reads interleaving in FASTQ files.

USAGE

Installation

Case 1

This requires you to install FastqWiper and therefore not to use workflows. You can do it for all OSs:

Use Conda

conda create -n fastqwiper python=3.11
conda activate fastqwiper
conda install -c bfxcss -c conda-forge fastqwiper

wipertools --help

Hint: for an healthier experience, use mamba

Use Pypi

pip install fastqwiper


Usage

usage: wipertools [-h] {fastqwiper,splitfastq,summarygather} ...

positional arguments:
    fastqwiper          FastqWiper program
    splitfastq          FASTQ splitter program
    summarygather       Gatherer of the FastqWiper summaries

options:
  -h, --help            show this help message and exit
usage: wipertools fastqwiper [-h] -i FASTQ_IN -o FASTQ_OUT [-l [LOG_OUT]] [-f [LOG_FREQUENCY]] [-a [ALPHABET]]

options:
  -i, --fastq_in TEXT          The input FASTQ file to be cleaned  [required]
  -o, --fastq_out TEXT         The wiped FASTQ file                [required]
  -l, --log_frequency INTEGER  The number of reads you want to print a status message. Default: 500000
  -f, --log_out TEXT           The file name of the final quality report summary. Print on the screen if not specified
  -a, --alphabet               Allowed character in the SEQ line. Default: ACGTN
  -h, --help                   Show this message and exit.


FastqWiper accepts strictly readable *.fastq or *.fastq.gz files in input.

Case 2 & Case 3

There are QUICK and a SLOW methods to configure FastqWiper's workflows.

One quick way (Docker)

  1. Pull the Docker image from DockerHub:

docker pull mazzalab/fastqwiper

  1. Once downloaded the image, type:

CMD: docker run --rm -ti --name fastqwiper -v "YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data" mazzalab/fastqwiper paired 8 sample 33 ACGTN 500000

Another quick way (Singularity)

  1. Pull the Singularity image from the Cloud Library:

singularity pull library://mazzalab/fastqwiper/fastqwiper.sif

  1. Once downloaded the image (e.g., fastqwiper.sif_2024.2.104.sif), type:

CMD singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data --writable-tmpfs fastqwiper.sif_2024.2.104.sif paired 8 sample 33 ACGTN 500000

If you want to bind the .singularity cache folder and the logs folder, you can omit --writable-tmpfs, create the folders .singularity and logs (mkdir .singularity logs) on the host system, and use this command instead:

CMD: singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER/:/fastqwiper/data --bind YOUR_LOCAL_PATH_TO_.SNAKEMAKE_FOLDER/:/fastqwiper/.snakemake --bind YOUR_LOCAL_PATH_TO_LOGS_FOLDER/:/fastqwiper/logs fastqwiper.sif_2024.2.104.sif paired 8 sample 33 ACGTN 500000

For both Docker and Singularity:

The slow way (Linux & Mac OS)

To enable the use of preconfigured pipelines, you need to install Snakemake. The recommended way to install Snakemake is via Conda, because it enables Snakemake to handle software dependencies of your workflow. However, the default conda solver is slow and often hangs. Therefore, we recommend installing Mamba as a drop-in replacement via

conda install -c conda-forge mamba

if you have anaconda/miniconda already installed, or directly installing Mambaforge as described here.

Then, create and activate a clean environment as above:

mamba create -n fastqwiper python=3.11
mamba activate fastqwiper

Finally, install the Snakemake dependency:

mamba install -c bioconda snakemake

Usage

Clone the FastqWiper repository in a folder of your choice and enter it:

git clone https://github.com/mazzalab/fastqwiper.git
cd fastqwiper

It contains, in particular, a folder data containing the fastq files to be processed, a folder pipeline containing the released pipelines and a folder fastqwiper with the source files of FastqWiper.
Input files to be processed must be copied into the data folder.

Currently, to run the FastqWiper pipelines, the following packages need to be installed manually:

required packages:

gzrt (Linux build from source instructions, Ubuntu install instructions, Mac OS install instructions)

BBTools (install instructions)

If installed from source, gzrt scripts need to be put on PATH. bbmap must be installed in the root folder of FastqWiper, as the image below

FastqWiper folder yierarchy

Commands:

Copy the fastq files you want to fix in the data folder.

N.b.: In all commands above, you will pass the name of the sample to be analyzed to the workflow through the config argument: sample_name. Remember that your fastq files' names must finish with _R1.fastq.gz and _R2.fastq.gz, for paired fastq files, and with .fastq.gz, for individual fastq files, and, therefore, the text to be assigned to the variable sample_name must be everything before them. E.g., if your files are my_sample_R1.fastq.gz and my_sample_R2.fastq.gz, then --config sample_name=my_sample.

Paired-end files

Fixed files will be copied in the data folder and will be suffixed with the string _fixed_wiped_paired_interleaving. We remind that the fix_wipe_pairs_reads_sequential.smk and fix_wipe_pairs_reads_parallel.smk pipelines perform the following actions:

Single-end files

fix_wipe_single_reads_parallel.smk and fix_wipe_single_reads_sequential.smk will not execute trimmomatic and BBmap's repair.sh.

Author

Tommaso Mazza
X LinkedIn

Laboratory of Bioinformatics
Fondazione IRCCS Casa Sollievo della Sofferenza
Viale Regina Margherita 261 - 00198 Roma IT
Tel: +39 06 44160526 - Fax: +39 06 44160548
E-mail: t.mazza@operapadrepio.it
Web page: http://www.css-mendel.it
Web page: http://bioinformatics.css-mendel.it