lentendu / DeltaMP

A flexible, reproducible and resource efficient metabarcoding amplicon pipeline for HPC
GNU General Public License v3.0
2 stars 1 forks source link

DeltaMP

A flexible, reproducible and resource efficient metabarcoding amplicon pipeline for HPC

DeltaMP is a command line tool for high performance computer taking advantage of queueing systems to parallelize and standardize bioinformatic processing of metabarcoding raw read libraries.

DeltaMP is initially developed to process 16S, 18S, 28S, ITS, COI and rbcl raw read libraries with the most up-to-date bioinformatic workflows, but can also handle any other barcoding targets (e.g. 23/28S, rbcL).

DeltaMP intend to be accessible for non-bioinformatician users with its fully tunable workflows based on a TAB-separated configuration file.

DeltaMP integrate a checkpointing feature which enable easy and efficient comparisons between different workflows applied on the same set of read libraries.

Last but not least, DeltaMP produces version controlled, reproducible and fully documented OTU tables in TAB-separated and BIOM formats readily usable for downstream taxonomic and OTU diversity analyses.

Table of Contents

Installation

Source code of DeltaMP version 0.6 is available at https://github.com/lentendu/DeltaMP/releases/tag/v0.6

The repository could also be cloned using for example:

git clone https://github.com/lentendu/DeltaMP.git

Then install following the installation instructions.

Dependencies

DeltaMP is intend to be used on a HPC with a job scheduler (i.e. batch-queuing system).

supported job schedulers:

SLURM

Grid Engine

compulsory softwares:

optional softwares:

All this dependencies need to be available through the $PATH environmental variable or need to be loaded by the DeltaMP module file.

Quick start

Usage instructions

Usages could be accessed via the command line:

deltamp -h
NAME
    DeltaMP version 0.6 - a flexible, reproducible and resource efficient metabarcoding amplicon pipeline for HPC

SYNOPSIS
    Usage: deltamp [-a account_or_project_name] [-cdfhnqtx] [-m max_running_tasks] [-p reference_subproject] [-r step] configuration_file

DESCRIPTION
    -h  display this help and exit

    -a ACCOUNT
        account or project name for the job queuing system

    -c  check dry run (-d option) by preserving the SUBPROJECT and PROJECT directories in the output directory

    -d  dry run avoid any submission to the queuing system and only output submission informations

    -f  display default values for optional configuration parameters

    -n  avoid checkpointing (i.e. searching for previous SUBPROJECT) and run the pipeline from the beginning

    -m max_running_tasks
        fix a maximum number of concurrently running tasks per array jobs, default is 400

    -p SUBRPOJECT
        only execute additional jobs or with different input variables compared to the reference SUBPROJECT. This will only work if both configuration files have the same input libraries and
        output directory.

    -q  proceed until quality step only

    -r STEP
        restart pipeline computation from STEP. Replace STEP by 'list' to list all available steps of the subproject associated with the provided configuration file.

    -t  proceed until demultiplexing step only

    -x  delete the subproject associated with the provided configuration file

AUTHOR
    Guillaume Lentendu and Tesfaye Wubet

REPORTING BUGS
    Submit suggestions and bug-reports at <https://github.com/lentendu/DeltaMP/issues>, send a pull request on <https://github.com/lentendu/DeltaMP>, or compose an e-mail to Guillaume Lentendu <guilaume.lentendu@unine.ch>.

COPYRIGHT
    Copyright (C) 2018 Guillaume Lentendu and Tesfaye Wubet

    This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of
    the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Configuration file

To run, DeltaMP always required a configuration file and, optionaly, local copies of raw sequence libraries.

The configuration is encoded in UTF8 and the fields are TAB-separated.

If the file is edited under Windows, pay attention to use Unix compliant end of line (\n).

The following sections describe the parameters to provide in the configuration file.

Only the project name parameter and a sample list (see SAMPLE section) are compulsory.

A parameter is recognize by its exact definition in the first column of the configuration file.

If an optional parameter is let empty or if its definition is not found in the first column, its default value will be used instead.

The default values for the optional parameters could be displayed by using the -f option (see the usage instructions).

PROJECT section

LIBRARIES section

TARGET section

PAIR-END section

all Illumina-specific parameters

TRIMMING section

PIPELINE section

SAMPLES section

Columns 3 and 4 accept libraies in fastq or sff format with most kinds of compression (.gz, .tar, .tar.gz, .tgz, .bz2, .tar.bz2, .tbz2, .zip).

For ENA libraries, the column 2 have to match ENA "Submitter's sample name" field and column 3 to 4 have to match ENA full ftp or fasp (for aspera connect download) URLs of a run accession as listed in ENA fields “Submitted files (FTP)” or "FASTQ files (FTP)" or “Submitted files (Aspera)” or "FASTQ files (Aspera)".

Example configuration files 'configuration_xxx.tsv' are available in the test/ directory after installation with make.

Pipeline execution

Quick start

To execute the full pipeline, run the following command in a terminal:

deltamp [path/]configuration_file

replacing [path/] by the path to your configuration file, and “configuration_file” by its filename. Avoid “path/” if you are already in the right directory.

For each DeltaMP execution, a "PROJECT” directory labeled with the project name will be created into the "Path to output location" (nothing done if this directory already exist). A “SUBPROJECT” directory will be created inside the PROJECT directory and will be labeled as follow: “Project name”_”Sequencing technology”_”Target region”_”unique identifier”. The unique identifier (cksum of date, time and configuration file) enables differentiation between each instance of the pipeline execution for the same project.

The SUBPROJECT directory is copied into the "Path to execution location" and all necessary jobs are submited to the queue.

Once all step's jobs are completed, the outputs are copied back into the "Path to output location"/PROJECT/SUBPROJECT directory.

Directories structure

best practices

Checkpointing

In order to avoid to run the pipeline from the beginning if only one parameter is modified in the configuration file (e.g. clustering algorithm), DeltaMP allow re-use of previously produced files in the execution directory by checkpointing.

Checkpointing is turn on by default, which means that if any SUBPROJECT with the same target, same sequencing technology and the same list of samples is found in the "Path to output location/PROJECT" directory, all configuration parameters and option values will be compared.

Then come different situations:

Relationship tree among successive checkpointed SUBPROJECTs are documented in the output directory of SUBPROJECTs with at least one previous SUBPROJECT at config/tree.summary.

Checkpointing could be repeated multiple times, so long at least one parameter differ with any of the previous SUBPROJECTs.

The previous SUBPROJECT could be hard set with the -p option, otherwise the previous SUBPROJECT with highest number of common steps will be taken as reference.

Checkpointing can be turn off by using the -n option.

Pipeline analysis steps

Running the DeltaMP command will generate the directories and configuration files necessary to conduct the pipeline analysis as well as submitting the required steps to the queueing system.

The following steps are bash scripts available in the bin directory after build and are named following the scheme “xxx.sh”, xxx being the name of the step.

For 454 or Illumina specific steps, step script filenames follow “454_xxx.sh” and “Illumina_xxx.sh” schemes, respectively.

Each step job is waiting in the queue so long its preceeding step job is not completed, except for the cut_db step which do not have to wait on any job completion. For checkpointing, the init step job will wait on the completion of the last common step job from the previous SUBPROJECT.

Outputs

documentation:

SUBPROJECT.documentation.txt: The file describes the whole processing of the libraries through the pipeline in a human readable format. This file is not archived and will be directly outputted in the SUBPROJECT output directory.

files in SUBPROJECT.outputs.tar.gz at the end of a full analyses :

The TAB-separated OTU matrices each contains a dense matrix of read counts per OTUs in each sample, each row corresponding to an OTU and each column to a sample, the second to last column containing the consensus taxonomic assignment (labeled as “taxonomy”), and the last column listing the representative sequence identifiers (labeled as “repseq”). Taxonomic rank with no consensus assignment are labeled with “unidentified”.

Troubleshooting

Job execution could be visualized with the grid engine native tool qstat (grid Engine) or squeue (SLURM).

If a job is waiting in error state or if it dependencies is never satisfied, it means that the job or its preceding job exit due to an error detected inside of the job.

If it concerns the quality step, the most common error is a number of reads below the set limit.

For any other jobs, it means that the previous step unexpectedly terminate before its end or that an error was issued by mothur in the previous step.

In all those cases, check the standard output (.out) and standard error (.err) log files of the respective step(s) and array(s), which are situated in the "Path to execution location"/SUBPROJECT/log directory.

To detect job terminated due to overpassing requested memory and/or time, compare the requested memory/time in the problematic step's script with the maximum used memory/runtime during job execution. To print job's record after execution, use qacct (grid Engine) or sacct (SLURM).

For unsolved issues, send an email to guillaume.lentendu@unine.ch, including the .out and .err log files as well as the output of the qacct command for the problematic job.

References