The GPS Pipeline is a Nextflow pipeline designed for processing raw reads (FASTQ files) of Streptococcus pneumoniae samples. After preprocessing, the pipeline performs initial assessment based on the total bases in reads. Passed samples will be further assess based on assembly, mapping, and taxonomy. If the sample passes all quality controls (QC), the pipeline also provides the sample's serotype, multi-locus sequence typing (MLST), lineage (based on the Global Pneumococcal Sequence Cluster (GPSC)), and antimicrobial resistance (AMR) against multiple antimicrobials.
The pipeline is designed to be easy to set up and use, and is suitable for use on local machines and high-performance computing (HPC) clusters alike. Additionally, the pipeline only downloads essential files to enable the analysis, and no data is uploaded from the local environment, making it an ideal option for cases where the FASTQ files being analysed is confidential. After initialisation or the first successful complete run, the pipeline can be used offline unless you have changed the selection of any database or container image.
The development of this pipeline is part of the GPS Project (Global Pneumococcal Sequencing Project).
ℹ️ For Linux, Docker Engine or Singularity/Apptainer is recommended. Docker Desktop for Linux is known to cause permission issues on Linux, which could prevent the pipeline from working.
ℹ️ Make sure you also install
docker-compose-plugin
as per the guide
ℹ️ After installation, you might need to allow Docker to access more system resources, especially CPU and Memory, to match the hardware requirement of the pipeline
It is recommended to have at least 16GB of RAM and 100GB of free storage
ℹ️ Details on storage
- The pipeline core files use ~5MB
- All default databases use ~19GB in total
- All Docker images use ~13GB in total; alternatively, Singularity images use ~4.5GB in total
- The pipeline generates ~1.8GB intermediate files for each sample on average
(These files can be removed when the pipeline run is completed, please refer to Clean Up)
(To further reduce storage requirement by sacrificing the ability to resume the pipeline, please refer to Experimental)Accepted Inputs
- Only Illumina paired-end short reads are supported
- Each sample is expected to be a pair of raw reads following this file name pattern:
*_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}
- example 1:
SampleName_R1_001.fastq.gz
,SampleName_R2_001.fastq.gz
- example 2:
SampleName_1.fastq.gz
,SampleName_2.fastq.gz
- example 3:
SampleName_R1.fq
,SampleName_R2.fq
Setup
- Clone the repository (if Git is installed on your system)
git clone https://github.com/sanger-bentley-group/gps-pipeline.git
or
Download and unzip/extract the [latest release](https://github.com/sanger-bentley-group/gps-pipeline/releases)
cd gps-pipeline
⚠️ Docker or Singularity must be running, and an Internet connection is required.
- Using Docker as the container engine
./run_pipeline --init
- Using Singularity as the container engine
./run_pipeline --init -profile singularity
⚠️ Docker or Singularity must be running.
⚠️ If this is the first run and initialisation was not performed, an Internet connection is required.
ℹ️ By default, Docker is used as the container engine and all the processes are executed by the local machine. See Profile for details on running the pipeline with Singularity or on a HPC cluster.
- You can run the pipeline without options. It will attempt to get the raw reads from the default location (i.e.
input
directory inside thegps-pipeline
local directory)./run_pipeline
- You can also specify the location of the raw reads by adding the
--reads
option./run_pipeline --reads /path/to/raw-reads-directory
- For a test run, you could obtain a small test dataset by running the included
download_test_input
script. The dataset will be saved to thetest_input
directory inside the pipeline local directory. You can then run the pipeline on the test data./download_test_input ./run_pipeline --reads test_input
9870_5#52
will fail the Taxonomy QC and hence Overall QC, therefore without analysis results17175_7#59
and21127_1#156
should pass Overall QC, therefore with analysis results
-profile
option to switch to other available profiles
ℹ️
-profile
is a built-in Nextflow option, it only has one leading-
./run_pipeline -profile [profile name]
Available profiles: | Profile Name | Details |
---|---|---|
standard (Default) |
Docker is used as the container engine. Processes are executed locally. |
|
singularity |
Singularity is used as the container engine. Processes are executed locally. |
|
lsf |
The pipeline should be launched from a LSF cluster head node with this profile. Singularity is used as the container engine. Processes are submitted to your LSF cluster via bsub by the pipeline. (Tested on Wellcome Sanger Institute farm5 LSF cluster only) (Option --kraken2_memory_mapping default change to false .) |
-resume
option can be used to resume the pipeline execution instead of starting from scratch again-resume
at the end (i.e. all pipeline options should be identical)
ℹ️
-resume
is a built-in Nextflow option, it only has one leading-
- If the original command is
./run_pipeline --reads /path/to/raw-reads-directory
- The command to resume the pipeline execution should be
./run_pipeline --reads /path/to/raw-reads-directory -resume
-resume
option or those intermediate files, you can remove the intermediate files using one of the following ways:
clean_pipeline
scriptwork
directory and log files within the gps-pipeline
local directory
./clean_pipeline
work
directory and log files within the gps-pipeline
local directory
rm -rf work
rm -rf .nextflow.log*
nextflow clean
commandnextflow clean
, refer to the Nextflow documentation
./nextflow clean
The pipeline is compatible with Launchpad of Seqera Platform (previously known as Nextflow Tower) and Nextflow -with-tower
option. For more information, please refer to the Seqera Platform documentation.
./run_pipeline [option] [value]
ℹ️ To permanently change the value of an option, edit the
nextflow.config
file inside thegps-pipeline
local directory.ℹ️
$projectDir
is a Nextflow built-in implicit variables, it is defined as the local directory ofgps-pipeline
.ℹ️ Pipeline options are not built-in Nextflow options, they are lead with
--
instead of-
Option | Values | Description |
---|---|---|
--init |
true or false (Default: false ) |
Use alternative workflow for initialisation, which means downloading all required additional files and container images, and creating databases. Can be enabled by including --init without value. |
--version |
true or false (Default: false ) |
Use alternative workflow for showing versions of pipeline, container images, tools and databases. Can be enabled by including --version without value.(This workflow pulls the required container images if they are not yet available locally) |
--help |
true or false (Default: false ) |
Show help message. Can be enabled by including --help without value. |
⚠️
--output
overwrites existing results in the target directory if there is any⚠️
--db
does not accept user provided local databases, directory content will be overwritten
Option Values Description --reads
Any valid path
(Default:"$projectDir/input"
)Path to the input directory that contains the reads to be processed. --output
Any valid path
(Default:"$projectDir/output"
)Path to the output directory that save the results. --db
Any valid path
(Default:"$projectDir/databases"
)Path to the directory saving databases used by the pipeline. --assembly_publish
"link"
or"symlink"
or"copy"
(Default:"link"
)Method used by Nextflow to publish the generated assemblies.
(The default setting"link"
means hard link, therefore will fail if the output directory is set to outside of the working file system)
ℹ️ Read QC does not have directly accessible parameters. The minimum base count in reads of Read QC is based on the multiplication of
--length_low
and--depth
of Assembly QC (i.e. default value is38000000
).
Option Values Description --spneumo_percentage
Any integer or float value
(Default:60.00
)Minimum S. pneumoniae percentage in reads to pass Taxonomy QC. --non_strep_percentage
Any integer or float value
(Default:2.00
)Maximum non-Streptococcus genus percentage in reads to pass Taxonomy QC. --ref_coverage
Any integer or float value
(Default:60.00
)Minimum reference coverage percentage by the reads to pass Mapping QC. --het_snp_site
Any integer value
(Default:220
)Maximum non-cluster heterozygous SNP (Het-SNP) site count to pass Mapping QC. --contigs
Any integer value
(Default:500
)Maximum contig count in assembly to pass Assembly QC. --length_low
Any integer value
(Default:1900000
)Minimum assembly length to pass Assembly QC. --length_high
Any integer value
(Default:2300000
)Maximum assembly length to pass Assembly QC. --depth
Any integer or float value
(Default:20.00
)Minimum sequencing depth to pass Assembly QC.
ℹ️ The output of SPAdes-based assembler is deterministic for a given count of threads. Hence, using
--assembler_thread
with a specific value can guarantee the generated assemblies will be reproducible for others using the same value.
Option Values Description --assembler
"shovill"
or"unicycler"
(Default:"shovill"
)Using which SPAdes-based assembler to assemble the reads. --assembler_thread
Any integer value
(Default:0
)Number of threads used by the assembler. 0
means all available.--min_contig_length
Any integer value
(Default:500
)Minimum legnth of contig to be included in the assembly.
Option | Values | Description |
---|---|---|
--ref_genome |
Any valid path to a .fa or .fasta file(Default: "$projectDir/data/ATCC_700669_v1.fa" ) |
Path to the reference genome for mapping. |
Option | Values | Description |
---|---|---|
--kraken2_db_remote |
Any valid URL to a Kraken2 database in .tar.gz or .tgz format(Default: Minikraken v1) |
URL to a Kraken2 database. |
--kraken2_memory_mapping |
true or false (Default: true ) |
Using the memory mapping option of Kraken2 or not.true means not loading the database into RAM, suitable for memory-limited or fast storage environments. |
Option | Values | Description |
---|---|---|
--seroba_db_remote |
Any valid URL to a SeroBA release in .tar.gz or .tgz format(Default: SeroBA v1.0.7) |
URL to a SeroBA release. |
--seroba_kmer |
Any integer value (Default: 71 ) |
Kmer size for creating the KMC database of SeroBA. |
Option | Values | Description |
---|---|---|
--poppunk_db_remote |
Any valid URL to a PopPUNK database in .tar.gz or .tgz format(Default: GPS v9) |
URL to a PopPUNK database. |
--poppunk_ext_remote |
Any valid URL to a PopPUNK external clusters file in .csv format(Default: GPS v9 GPSC Designation) |
URL to a PopPUNK external clusters file. |
Option | Values | Description |
---|---|---|
--ariba_ref |
Any valid path to a .fa or .fasta file(Default: "$projectDir/data/ariba_ref_sequences.fasta" ) |
Path to the reference sequences for preparing ARIBA database. |
--ariba_metadata |
Any valid path to a tsv file(Default: "$projectDir/data/ariba_metadata.tsv" ) |
Path to the metadata file for preparing ARIBA database. |
--resistance_to_mic |
Any valid path to a tsv file(Default: "$projectDir/data/resistance_to_MIC.tsv" ) |
Path to the resistance phenotypes to MIC (minimum inhibitory concentration) lookup table. |
ℹ️ This section is only valid when Singularity is used as the container engine
Option | Values | Description |
---|---|---|
--singularity_cachedir |
Any valid path (Default: "$projectDir/singularity_cache" ) |
Path to the directory where Singularity images should be saved to. |
Option | Values | Description |
---|---|---|
--lite |
true or false (Default: false ) |
⚠️ Enable this option breaks Nextflow resume function. Reduce storage requirement by removing intermediate .sam and .bam files once they are no longer needed while the pipeline is still running.The quantity of reduction of storage requirement cannot be guaranteed. Can be enabled by including --lite without value. |
output
directory inside the gps-pipeline
local directory--output
./run_pipeline --output /path/to/output-directory
The following directories and files are output into the output directory | Directory / File | Description |
---|---|---|
assemblies |
This directory contains all assemblies (.fasta ) generated by the pipeline |
|
results.csv |
This file contains all the information generated by the pipeline on each sample | |
info.txt |
This file contains information regarding the pipeline and parameters of the run |
results.csv
results.csv
ℹ️ The output fields in Other AMR / Virulence type depends on the provided ARIBA reference sequences and metadata file, and resistance phenotypes to MIC lookup table, the below table is based on the defaults.
ℹ️ The inferred Minimum Inhibitory Concentration (MIC) range of an antimicrobial in "Other AMR" type is only provided if it is included in the resistance phenotypes to MIC lookup table. The default lookup table is based on 2014 CLSI guidelines.
ℹ️ For resistance phenotypes: S = Sensitive/Susceptible; I = Intermediate; R = Resistant
ℹ️ For virulence genes: POS = Positive; NEG = Negative
⚠️ If the result of
Overall_QC
of a sample isREAD_ONE_CORRUPTED
,READ_TWO_CORRUPTED
or both, the specific read file is found to be corrupted (i.e. incomplete/damaged Gzip file, mis-match(s) in read length and quality-score length). You might want to reacquire the read file from its source, or discard the sample if the source file is corrupted as well.⚠️ If the result of
Overall_QC
of a sample isASSEMBLER FAILURE
, the assembler has crashed when trying to assembly the reads. You might want to re-run the sample with another assembler, or discard the sample if it is a low quality one.⚠️ If the result of
Serotype
of a sample isSEROBA FAILURE
, SeroBA has crashed when trying to serotype the sample.
Field Type Description Sample_ID
Identification Sample ID based on the raw reads file name Read_QC
QC Read quality control result Assembly_QC
QC Assembly quality control result Mapping_QC
QC Mapping quality control result Taxonomy_QC
QC Taxonomy quality control result Overall_QC
QC Overall quality control result
(Based onAssembly_QC
,Mapping_QC
andTaxonomy_QC
)Bases
Read Number of bases in the reads
(Default: ≥ 38 Mb to pass Read QC)Contigs#
Assembly Number of contigs in the assembly
(Default: ≤ 500 to pass Assembly QC)Assembly_Length
Assembly Total length of the assembly
(Default: 1.9 - 2.3 Mb to pass Assembly QC)Seq_Depth
Assembly Sequencing depth of the assembly
(Default: ≥ 20x to pass Assembly QC)Ref_Cov_%
Mapping Percentage of reference covered by reads
(Default: ≥ 60% to pass Mapping QC)Het-SNP#
Mapping Non-cluster heterozygous SNP (Het-SNP) site count
(Default: ≤ 220 to pass Mapping QC)S.Pneumo_%
Taxonomy Percentage of reads assigned to Streptococcus pneumoniae
(Default: ≥ 60% to pass Taxonomy QC)Top_Non-Strep_Genus
Taxonomy The most abundant non-Streptococcus genus in reads Top_Non-Strep_Genus_%
Taxonomy Percentage of reads assigned to the most abundant non-Streptococcus genus
(Default: ≤ 2% to pass Taxonomy QC)GPSC
Lineage GPSC Lineage Serotype
Serotype Serotype ST
MLST Sequence Type (ST) aroE
MLST Allele ID of aroE gdh
MLST Allele ID of gdh gki
MLST Allele ID of gki recP
MLST Allele ID of recP spi
MLST Allele ID of spi xpt
MLST Allele ID of xpt ddl
MLST Allele ID of ddl pbp1a
PBP AMR Allele ID of pbp1a pbp2b
PBP AMR Allele ID of pbp2b pbp2x
PBP AMR Allele ID of pbp2x AMO_MIC
PBP AMR Estimated minimum inhibitory concentration (MIC) of amoxicillin (AMO) AMO_Res
PBP AMR Inferred resistance phenotype against AMO CFT_MIC
PBP AMR Estimated MIC of ceftriaxone (CFT) CFT_Res(Meningital)
PBP AMR Inferred resistance phenotype against CFT in meningital form CFT_Res(Non-meningital)
PBP AMR Inferred resistance phenotype against CFT in non-meningital form TAX_MIC
PBP AMR Estimated MIC of cefotaxime (TAX) TAX_Res(Meningital)
PBP AMR Inferred resistance phenotype against TAX in meningital form TAX_Res(Non-meningital)
PBP AMR Inferred resistance phenotype against TAX in non-meningital form CFX_MIC
PBP AMR Estimated MIC of cefuroxime (CFX) CFX_Res
PBP AMR Inferred resistance phenotype against CFX MER_MIC
PBP AMR Estimated MIC of meropenem (MER) MER_Res
PBP AMR Inferred resistance phenotype against MER PEN_MIC
PBP AMR Estimated MIC of penicillin (PEN) PEN_Res(Meningital)
PBP AMR Inferred resistance phenotype against PEN in meningital form PEN_Res(Non-meningital)
PBP AMR Inferred resistance phenotype against PEN in non-meningital form CHL_MIC
Other AMR Inferred MIC of Chloramphenicol (CHL) CHL_Res
Other AMR Estimated resistance phenotype against CHL CHL_Determinant
Other AMR Known determinants that estimated the CHL resistance phenotype CLI_MIC
Other AMR Inferred MIC of Clindamycin (CLI) CLI_Res
Other AMR Estimated resistance phenotype against CLI CLI_Determinant
Other AMR Known determinants that estimated the CLI resistance phenotype COT_MIC
Other AMR Inferred MIC of Co-Trimoxazole (COT) COT_Res
Other AMR Estimated resistance phenotype against COT COT_Determinant
Other AMR Known determinants that estimated the COT resistance phenotype DOX_MIC
Other AMR Inferred MIC of Doxycycline (DOX) DOX_Res
Other AMR Estimated resistance phenotype against DOX DOX_Determinant
Other AMR Known determinants that estimated the DOX resistance phenotype ERY_MIC
Other AMR Inferred MIC of Erythromycin (ERY) ERY_Res
Other AMR Estimated resistance phenotype against ERY ERY_Determinant
Other AMR Known determinants that estimated the ERY resistance phenotype ERY_CLI_Res
Other AMR Estimated resistance phenotype against Erythromycin (ERY) and Clindamycin (CLI) ERY_CLI_Determinant
Other AMR Known determinants that estimated the ERY and CLI resistance phenotype FQ_Res
Other AMR Estimated resistance phenotype against Fluoroquinolones (FQ) FQ_Determinant
Other AMR Known determinants that estimated the FQ resistance phenotype KAN_Res
Other AMR Estimated resistance phenotype against Kanamycin (KAN) KAN_Determinant
Other AMR Known determinants that estimated the KAN resistance phenotype LFX_MIC
Other AMR Inferred MIC of Levofloxacin (LFX) LFX_Res
Other AMR Estimated resistance phenotype against LFX LFX_Determinant
Other AMR Known determinants that estimated the LFX resistance phenotype RIF_MIC
Other AMR Inferred MIC of Rifampin (RIF) RIF_Res
Other AMR Estimated resistance phenotype against RIF RIF_Determinant
Other AMR Known determinants that estimated the RIF resistance phenotype SMX_Res
Other AMR Estimated resistance phenotype against Sulfamethoxazole (SMX) SMX_Determinant
Other AMR Known determinants that estimated the SMX resistance phenotype TET_MIC
Other AMR Inferred MIC of Tetracycline (TET) TET_Res
Other AMR Estimated resistance phenotype against TET TET_Determinant
Other AMR Known determinants that estimated the TET resistance phenotype TMP_Res
Other AMR Estimated resistance phenotype against Trimethoprim (TMP) TMP_Determinant
Other AMR Known determinants that estimated the TMP resistance phenotype VAN_MIC
Other AMR Inferred MIC of Vancomycin (VAN) VAN_Res
Other AMR Estimated resistance phenotype against VAN VAN_Determinant
Other AMR Known determinants that estimated the VAN resistance phenotype PILI1
Virulence Expression of PILI-1 PILI1_Determinant
Virulence Known determinants that estimated the PILI-1 expression PILI2
Virulence Expression of PILI-2 PILI2_Determinant
Virulence Known determinants that estimated the PILI-2 expression
This project uses open-source components. You can find the homepage or source code of their open-source projects along with license information below. I acknowledge and am grateful to these developers for their contributions to open source.
GET_ARIBA_DB
and OTHER_RESISTANCE
processes of the amr.nf
moduleSAM_TO_SORTED_BAM
and SNP_CALL
processes of the mapping.nf
moduleGET_REF_GENOME_BWA_DB
and MAPPING
processes of the mapping.nf
moduleDocker Images of ARIBA, BCFtools, BWA, fastp, Kraken 2, mlst, PopPUNK, QUAST, SAMtools, Shovill, Unicycler
Docker Image of network-multitool
GENERATE_OVERALL_REPORT
process of the output.nf
module, HET_SNP_COUNT
process of the mapping.nf
module and PARSE_OTHER_RESISTANCE
process of the amr.nf
modulePREPROCESS
process of the preprocess.nf
moduleget_lineage.sh
scriptTAXONOMY
process of the taxonomy.nf
modulehet_snp_count.py
scriptMLST
process of the mlst.nf
modulenextflow
is included in this repositoryLINEAGE
process of the lineage.nf
moduleASSEMBLY_ASSESS
process of the assembly.nf
moduleGET_SEROBA_DB
and SEROTYPE
processes of the serotype.nf
modulesequences.fasta
is renamed to ariba_ref_sequences.fasta
and modifiedmetadata.tsv
is renamed to ariba_metadata.tsv
and modifiedGET_ARIBA_DB
process of the amr.nf
moduleASSEMBLY_SHOVILL
process of the assembly.nf
modulePBP_RESISTANCE
process of the amr.nf
module ASSEMBLY_UNICYCLER
process of the assembly.nf
module