sanger-bentley-group / gps-pipeline

Nextflow Pipeline for processing Streptococcus pneumoniae sequencing raw reads (FASTQ files) by the GPS Project (Global Pneumococcal Sequencing Project)
GNU General Public License v3.0
12 stars 4 forks source link
fastq genomics insilico nextflow pipeline streptococcus-pneumoniae

GPS Pipeline

Nextflow run with docker run with singularity Launch on Seqera Platform

The GPS Pipeline is a Nextflow pipeline designed for processing raw reads (FASTQ files) of Streptococcus pneumoniae samples. After preprocessing, the pipeline performs initial assessment based on the total bases in reads. Passed samples will be further assess based on assembly, mapping, and taxonomy. If the sample passes all quality controls (QC), the pipeline also provides the sample's serotype, multi-locus sequence typing (MLST), lineage (based on the Global Pneumococcal Sequence Cluster (GPSC)), and antimicrobial resistance (AMR) against multiple antimicrobials.

The pipeline is designed to be easy to set up and use, and is suitable for use on local machines and high-performance computing (HPC) clusters alike. Additionally, the pipeline only downloads essential files to enable the analysis, and no data is uploaded from the local environment, making it an ideal option for cases where the FASTQ files being analysed is confidential. After initialisation or the first successful complete run, the pipeline can be used offline unless you have changed the selection of any database or container image.

The development of this pipeline is part of the GPS Project (Global Pneumococcal Sequencing Project).

 

Table of contents

 

Workflow

Workflow

 

Usage

Requirements

Software

Hardware

It is recommended to have at least 16GB of RAM and 100GB of free storage

ℹ️ Details on storage

  • The pipeline core files use ~5MB
  • All default databases use ~19GB in total
  • All Docker images use ~13GB in total; alternatively, Singularity images use ~4.5GB in total
  • The pipeline generates ~1.8GB intermediate files for each sample on average
    (These files can be removed when the pipeline run is completed, please refer to Clean Up)
    (To further reduce storage requirement by sacrificing the ability to resume the pipeline, please refer to Experimental)

    Accepted Inputs

  • Only Illumina paired-end short reads are supported
  • Each sample is expected to be a pair of raw reads following this file name pattern:
  • *_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}
  • example 1: SampleName_R1_001.fastq.gz, SampleName_R2_001.fastq.gz
  • example 2: SampleName_1.fastq.gz, SampleName_2.fastq.gz
  • example 3: SampleName_R1.fq, SampleName_R2.fq

    Setup

    1. Clone the repository (if Git is installed on your system)
      git clone https://github.com/sanger-bentley-group/gps-pipeline.git

      or

Download and unzip/extract the [latest release](https://github.com/sanger-bentley-group/gps-pipeline/releases)
  1. Go into the local directory of the pipeline and it is ready to use without installation (the directory name might be different)
    cd gps-pipeline
  2. (Optional) You could perform an initialisation to download all required additional files and container images, so the pipeline can be used at any time with or without the Internet afterwards.

    ⚠️ Docker or Singularity must be running, and an Internet connection is required.

    • Using Docker as the container engine
      ./run_pipeline --init
    • Using Singularity as the container engine
      ./run_pipeline --init -profile singularity

Run

⚠️ Docker or Singularity must be running.

⚠️ If this is the first run and initialisation was not performed, an Internet connection is required.

ℹ️ By default, Docker is used as the container engine and all the processes are executed by the local machine. See Profile for details on running the pipeline with Singularity or on a HPC cluster.

  • You can run the pipeline without options. It will attempt to get the raw reads from the default location (i.e. input directory inside the gps-pipeline local directory)
    ./run_pipeline
  • You can also specify the location of the raw reads by adding the --reads option
    ./run_pipeline --reads /path/to/raw-reads-directory
  • For a test run, you could obtain a small test dataset by running the included download_test_input script. The dataset will be saved to the test_input directory inside the pipeline local directory. You can then run the pipeline on the test data
    ./download_test_input
    ./run_pipeline --reads test_input
  • 9870_5#52 will fail the Taxonomy QC and hence Overall QC, therefore without analysis results
  • 17175_7#59 and 21127_1#156 should pass Overall QC, therefore with analysis results

Profile

Resume

Clean Up

Seqera Platform (Optional)

The pipeline is compatible with Launchpad of Seqera Platform (previously known as Nextflow Tower) and Nextflow -with-tower option. For more information, please refer to the Seqera Platform documentation.

 

Pipeline Options

Alternative Workflows

Option Values Description
--init true or false
(Default: false)
Use alternative workflow for initialisation, which means downloading all required additional files and container images, and creating databases.
Can be enabled by including --init without value.
--version true or false
(Default: false)
Use alternative workflow for showing versions of pipeline, container images, tools and databases.
Can be enabled by including --version without value.
(This workflow pulls the required container images if they are not yet available locally)
--help true or false
(Default: false)
Show help message.
Can be enabled by including --help without value.

Input and Output

⚠️ --output overwrites existing results in the target directory if there is any

⚠️ --db does not accept user provided local databases, directory content will be overwritten

Option Values Description
--reads Any valid path
(Default: "$projectDir/input")
Path to the input directory that contains the reads to be processed.
--output Any valid path
(Default: "$projectDir/output")
Path to the output directory that save the results.
--db Any valid path
(Default: "$projectDir/databases")
Path to the directory saving databases used by the pipeline.
--assembly_publish "link" or "symlink" or "copy"
(Default: "link")
Method used by Nextflow to publish the generated assemblies.
(The default setting "link" means hard link, therefore will fail if the output directory is set to outside of the working file system)

QC Parameters

ℹ️ Read QC does not have directly accessible parameters. The minimum base count in reads of Read QC is based on the multiplication of --length_low and --depth of Assembly QC (i.e. default value is 38000000).

Option Values Description
--spneumo_percentage Any integer or float value
(Default: 60.00)
Minimum S. pneumoniae percentage in reads to pass Taxonomy QC.
--non_strep_percentage Any integer or float value
(Default: 2.00)
Maximum non-Streptococcus genus percentage in reads to pass Taxonomy QC.
--ref_coverage Any integer or float value
(Default: 60.00)
Minimum reference coverage percentage by the reads to pass Mapping QC.
--het_snp_site Any integer value
(Default: 220)
Maximum non-cluster heterozygous SNP (Het-SNP) site count to pass Mapping QC.
--contigs Any integer value
(Default: 500)
Maximum contig count in assembly to pass Assembly QC.
--length_low Any integer value
(Default: 1900000)
Minimum assembly length to pass Assembly QC.
--length_high Any integer value
(Default: 2300000)
Maximum assembly length to pass Assembly QC.
--depth Any integer or float value
(Default: 20.00)
Minimum sequencing depth to pass Assembly QC.

Assembly

ℹ️ The output of SPAdes-based assembler is deterministic for a given count of threads. Hence, using --assembler_thread with a specific value can guarantee the generated assemblies will be reproducible for others using the same value.

Option Values Description
--assembler "shovill" or "unicycler"
(Default: "shovill")
Using which SPAdes-based assembler to assemble the reads.
--assembler_thread Any integer value
(Default: 0)
Number of threads used by the assembler. 0 means all available.
--min_contig_length Any integer value
(Default: 500)
Minimum legnth of contig to be included in the assembly.

Mapping

Option Values Description
--ref_genome Any valid path to a .fa or .fasta file
(Default: "$projectDir/data/ATCC_700669_v1.fa")
Path to the reference genome for mapping.

Taxonomy

Option Values Description
--kraken2_db_remote Any valid URL to a Kraken2 database in .tar.gz or .tgz format
(Default: Minikraken v1)
URL to a Kraken2 database.
--kraken2_memory_mapping true or false
(Default: true)
Using the memory mapping option of Kraken2 or not.
true means not loading the database into RAM, suitable for memory-limited or fast storage environments.

Serotype

Option Values Description
--seroba_db_remote Any valid URL to a SeroBA release in .tar.gz or .tgz format
(Default: SeroBA v1.0.7)
URL to a SeroBA release.
--seroba_kmer Any integer value
(Default: 71)
Kmer size for creating the KMC database of SeroBA.

Lineage

Option Values Description
--poppunk_db_remote Any valid URL to a PopPUNK database in .tar.gz or .tgz format
(Default: GPS v9)
URL to a PopPUNK database.
--poppunk_ext_remote Any valid URL to a PopPUNK external clusters file in .csv format
(Default: GPS v9 GPSC Designation)
URL to a PopPUNK external clusters file.

Other AMR

Option Values Description
--ariba_ref Any valid path to a .fa or .fasta file
(Default: "$projectDir/data/ariba_ref_sequences.fasta")
Path to the reference sequences for preparing ARIBA database.
--ariba_metadata Any valid path to a tsv file
(Default: "$projectDir/data/ariba_metadata.tsv")
Path to the metadata file for preparing ARIBA database.
--resistance_to_mic Any valid path to a tsv file
(Default: "$projectDir/data/resistance_to_MIC.tsv")
Path to the resistance phenotypes to MIC (minimum inhibitory concentration) lookup table.

Singularity

ℹ️ This section is only valid when Singularity is used as the container engine

Option Values Description
--singularity_cachedir Any valid path
(Default: "$projectDir/singularity_cache")
Path to the directory where Singularity images should be saved to.

Experimental

Option Values Description
--lite true or false
(Default: false)
⚠️ Enable this option breaks Nextflow resume function.
Reduce storage requirement by removing intermediate .sam and .bam files once they are no longer needed while the pipeline is still running.
The quantity of reduction of storage requirement cannot be guaranteed.
Can be enabled by including --lite without value.

Output

Details of results.csv

 

Credits

This project uses open-source components. You can find the homepage or source code of their open-source projects along with license information below. I acknowledge and am grateful to these developers for their contributions to open source.

ARIBA

BCFtools and SAMtools

BWA

Docker Images of ARIBA, BCFtools, BWA, fastp, Kraken 2, mlst, PopPUNK, QUAST, SAMtools, Shovill, Unicycler

Docker Image of network-multitool

Docker Image of Pandas

fastp

GPSC_pipeline_nf

Kraken 2

mecA-HetSites-calculator

mlst

Nextflow

PopPUNK

QUAST

SeroBA

resistanceDatabase

Shovill

SPN-PBP-AMR

Unicycler