piranha

Poliovirus Investigation Resource Automating Nanopore Haplotype Analysis

This tool is in active development.

Piranha is a tool developed to help standardise and streamline sequencing of poliovirus. It's been developed by members of the Rambaut group at the University of Edinburgh as part of the Poliovirus Sequencing Consortium. Piranha runs an end-to-end read-to-report analysis that produces distributable, interactive reports alongside analysed consensus data. By default, piranha will attempt to generate consensus genomes for populations of the sam It produces an overall report, summarising the entire sequencing run, as well as a sample-specific report. Samples with virus of interest, such as VDPV are highlighted and certain quality-control flags can alert the user if there are issues with the run (such as a failed negative or positive control or identical sequences between samples that may be the result of contamination).

Any issues or feedback about the analysis or report please flag to this repository.

Note: piranha has been tested primarily on poliovirus VP1 sequencing data. There are alternative analysis modes in development (e.g. whole genome, panEV), but the authors recommend additional QC checks if using piranha beyond its established poliovirus VP1 pipeline.

Example report here

Example data here

Installing via PIRANHA GUI

Piranha does not require use of the command-line to run. Users can access piranha through the PiranhaGUI application that can be downloaded at the link below. This application enables teams without access to a bioinformatician to use piranha and run standard detection of poliovirus VP1 in stool through nanopore sequencing.

Download the release package for your machine from the PiranhaGUI respository

Installing within EPI2ME

Piranha is also compatible with the EPI2ME workflow desktop application. This mode of running does not provide the same level of input checking and configurability as either the ARTIFICE GUI or the command line, but is provided as an alternative interactive interface. Once EPI2ME and docker have been downloaded, follow the instructions to download a custom workflow using the github path https://github.com/polio-nanopore/piranha.

Install through bioconda

Installation instructions (quick)

You need to have Git, a version of conda (link to Miniconda here) and mamba installed to run the following commands.

Detailed installation instructions given below, but - in brief - to install with mamba run the following in a terminal:

git clone https://github.com/polio-nanopore/piranha.git 
cd piranha
mamba env create -f environment.yml
conda activate piranha
pip install .

Installation instructions (full walkthrough)

Step 1: Git installation

If this is your first time running piranha, you’ll need to clone the GitHub repository and install it. If you're running on a Mac machine (OS X) and have never used command line tools such as Git before, you may need to install them. They can be installed easily with the installer. Full instructions can be found here. Linux systems will have Git installed already.

Double check you have git installed by typing the following into a terminal window:

git --version

You should see something like the following printed below:

git version 2.21.1 (Apple Git-122.3)

If you don't see this, follow the link to the install instructions above to install git. If you're using a Windows machine, it's possible to use Windows Subsystem for Linux, which should have git pre-installed.

Step 2: Conda\Mamba installation

You will also need to have a version of conda installed. We recommend the latest version of Miniconda, which can be accessed for download and install here. Follow the installation instructions on that page. After successful conda installation and initialistaion you should have (base) at the start of your command prompt before your username like below:

(base) aine$

You can double check your installation has completed successfully by checking the conda version:

(base) aine$ conda --version

This should return a version number. If it does not, repeat the installaion steps outlined here.

Once conda has been installed, we also recommend installing mamba. This can be done using the following command:

(base) aine$ conda install mamba -c conda-forge

Press y to confirm installation if prompted.

Step 3: Choose where to download piranha

I like to keep my repositories in the same place so they’re easy to find, so below I’m making a directory in my home directory (~) called repositories and moving into it with cd, which is short for "change directory".

(base) aine$ cd ~

(base) aine$ mkdir repositories

(base) aine$ cd repositories

Step 4: Clone the piranha repository

Now we’re ready to clone the piranha GitHub repository. This creates a local copy of the piranha repository on your computer. It also retains a link back to the original repository, so if any updates are made (such as addition of new features or bug fixes), it's very easy to pull those changes down to your local machine and update your version of piranha. More information on git cloning can be found here.

Clone the piranha repository with:

(base) aine$ git clone https://github.com/polio-nanopore/piranha.git

(base) aine$ cd piranha

This directory remains linked to the original GitHub copy, so if you need to update it or get any changes in the future, you can do that from this location with the following command:

(base) aine$ git pull

Step 5: Install piranha

We now need to create the piranha environment. Hopefully you have mamba installed, check if you do with the following command:

(base) aine$ mamba --version

If you see the message mamba not found then you should repeat the mamba installation instructions found at the end of step 2 above. Next we need to create the environment where piranha will run by using mamba. This following command needs to be run in the newly cloned piranha repository. To double check you're in the correct directory, you can type pwd (print working directory):

(piranha) aine$ pwd

/localdisk/home/repositories/piranha

If you see a path printed like the one above, ending with piranha, you know you're in the correct directory. If you are not in the correct directory, please navigate to it using the cd command. If you have followed the instructions in step 3 then your piranha repository will be located at ~/repositories/piranha/.

Once you are in the piranha directory you can create the piranha environment with this command:

mamba env create -f environment.yml

Press y to confirm installation if prompted. When completed activate the piranha environment:

(base) aine$ conda activate piranha

(piranha) aine$

Notice that the (base) in your command prompt has changed to (piranha). This tells you that the piranha environment is activated. Now you’ll need run install the piranha python package while you’re in the environment:

(piranha) aine$ pip install .

The . refers to the current working directory (cwd), which should be the piranha repository.

Congratulations! You should now have piranha installed.

Check the install worked

(piranha) aine$ piranha --version

Should return piranha and the version number installed:

piranha v1.0

If no errors have come up (such as messages saying "command not found"), you should now be ready to run piranha!

Installation issue checklist

Sometimes there can be issues unrelated to the commands you've run.

Firstly, check you're in the piranha environment. Your prompt should start with (piranha). If not, activate the piranha environment and check your install again.

Example:

(piranha) aine$ <- correct

(base) aine$ <- incorrect

A common issue can be related to internet connectivity. If a download has failed because of a break in internet, I suggest running through the commands again and just try again. mamba and conda can cache the files already downloaded, so often you can make progress even if internet connectivity is an issue.
Similarly, if you have a laptop with not enough storage space, installation and download can fail. To solve this, try clear some space on your machine and try again.
If you're still having trouble installing via the command line, piranha can be installed using the ARTIFICE GUI (linked above).
Please post an issue to the GitHub if there are further unresolved issues.

Updating piranha

Change directory to the piranha repository: cd piranha
Pull the latest changes from GitHub: git pull
Ensure you're in the piranha environment: conda activate piranha
Install the changes: pip install .

If there has been a major change to piranha, it's possible the environment will need to be updated as follows:

Change directory to the piranha repository: cd piranha
Pull the latest changes from GitHub: git pull
Ensure you're in the piranha environment: conda activate piranha
Update the environment: mamba env update -f environment.yml

Piranha is also available now on bioconda and can be installed with

mamba install -c bioconda piranha-polio

Dependencies If Not Installing Using the Conda Environment

Note: we recommend using piranha in the conda environment specified in the environment.yml file as per the instructions above. If you can't use conda for some reason, dependency details can be found in the environment.yml file.

Quick usage

piranha -i <demultiplexed read directory> -b <path/to/barcodes.csv>

What to put in a barcodes.csv file?

At a minimum, two columns- one for barcode information and one with the name you'd like your sample to be called. This is also where you can flag which samples are negative or positive controls. Barcode and sample names should be unique and sample names shouldn't contain spaces or special characters. This is because the sample name will get incorporated into the output consensus fasta header and spaces in a fasta header disrupt the sequence ID.

Minimal example:

barcode,sample
barcode01,EDI001
barcode02,EDI002
barcode03,EDI003
barcode04,negative
barcode05,positive

You can also include additional information in your barcode.csv. If you include a date column or an EPID column they'll automatically be included in the fasta header output too. Dates should always be in ISO format (YYYY-MM-DD) for metadata best practice. Piranha has a flag --all-metadata-to-header that will take any metadata fields in your barcodes.csv and append them to the final output file (separated by a | pipe symbol). Be aware any odd characters or spaces in these fields will also get added to the fasta header and can interfere with downstream phylogenetics you might want to run (e.g. :,;) can cause issues with some tree-building or reading software.

Ideal example:

barcode,sample,EPID,date
barcode01,EDI001,EPI111,2022-10-10
barcode02,EDI002,EPI112,2022-09-20
barcode03,EDI003,EPI113,2022-09,21
barcode04,negative,,
barcode05,positive,,

What should my reads/read directory look like?

Piranha is configured to make analysis as straightforward as possible for users running MinION sequencing. Piranha takes the output of guppy directly and looks for the demultiplexed read directory containing the barcodes you've specified in your barcodes.csv. Guppy outputs directories in the structure:

fastq_pass/
├── barcode01/
│    ├── FAT75518_pass_barcode01_6d172881_0.fastq.gz
│    ├── FAT75518_pass_barcode01_6d172881_1.fastq.gz
│    └── FAT75518_pass_barcode01_6d172881_2.fastq.gz
├── barcode02/
│    ├── FAT75518_pass_barcode02_6d172881_0.fastq.gz
│    ├── FAT75518_pass_barcode02_6d172881_1.fastq.gz
│    ├── FAT75518_pass_barcode02_6d172881_2.fastq.gz
│    ├── FAT75518_pass_barcode02_6d172881_3.fastq.gz
│    ├── FAT75518_pass_barcode02_6d172881_4.fastq.gz
│    └── FAT75518_pass_barcode02_6d172881_5.fastq.gz
├── barcode03/
│    ├── FAT75518_pass_barcode03_6d172881_0.fastq.gz
│    ├── FAT75518_pass_barcode03_6d172881_1.fastq.gz
│    └── FAT75518_pass_barcode03_6d172881_2.fastq.gz
└── barcode05/
     ├── FAT75518_pass_barcode05_6d172881_0.fastq.gz
     ├── FAT75518_pass_barcode05_6d172881_1.fastq.gz
     └── FAT75518_pass_barcode05_6d172881_2.fastq.gz

This is what piranha will look for. Point the software to the directory containing the different barcodeXX sub-directories and it will iterate within these to find the files. Piranha can accept fastq (fq) or fastq.gz files. It will only attempt to analyse the barcodes present in the input csv file.

Configuration

Input configuration

Piranha has been preconfigured with defaults specific to the VP1 protocol developed by the Polio Sequencing Consortium. All command line arguments (full list below) can be configured either as command line flags when running piranha, or as snakecase arguments in a yaml config file (which can then be supplied with the -c flag). Snakecase formatting separates works with an underscore, i.e. my_argument.

For example, you can supply a custom references file (piranha has a default one supplied, which you can access here) using the -r flag, or by pointing to it within the config file.

Example Case 1: command line argument

(piranha) aine$ piranha -i /localdisk/home/data/minion_run_1/fastq_pass -b /localdisk/home/data/minion_run_1/barcodes.csv -r /localdisk/home/custom_reference_file.fasta

This command shows an example piranha run, where everything is in the default settings except the reference file. In this case you point piranha to the read directory, the barcodes.csv file and your custom reference file.

Example Case 2: config file

Example config file (called config.yaml in current working directory). Actual example file can be found here.

readdir: /localdisk/home/data/minion_run_1/fastq_pass
barcodes_csv: /localdisk/home/data/minion_run_1/barcodes.csv
reference_sequences: /localdisk/home/custom_reference_file.fasta

Then to run piranha you can simply run the command below, and all the information in the file will be included in your run.

(piranha) aine$ piranha -c config.yaml

Flag which samples are controls

Piranha allows you to specify which samples are controls (positive or negative). By default, if the sample name contains negative or positive within the barcode csv file, piranha will automatically detect that these are control samples. See minimal example above for format.

You can overwrite this if you would rather call your controls something else (like nc, my_fave_control etc) with the flags -pc,--positive-control or -nc,--negative-control.

You can have one or more positive and negative controls. If specifying more than one sample they can be supplied as a comma-separated string on the command line, or as a list or comma-separated string in the config file.

Example:

piranha -i path/to/fastq_pass -b barcodes.csv -pc "Pos1,P2" -nc "my negative control"

Alternatively you can supply this in a config file with the fields:

positive_control: "Pos1,P2"
negative_control: Negative1

It is important to make sure that the fields match within barcodes.csv. Also note that above because there are spaces in sample names for the command line example negative control, this command will need quotes around the full name or else the terminal won't interpret it as a single field. ALSO, it's in general better to avoid having spaces in sample names because if you get a consensus sequence out of piranha as a fasta file, record ids are defined as the field up to the first space, so you can lose information in downstream analysis software if you're not careful. Best to just avoid spaces (and also special characters like :, ; and |) in general when dealing with this kind of data that might have phylogenetics run on it.

Samples flagged as controls will appear in the report at the end in a separate table as well and will be flagged as either passing (row in table coloured green and a tick appears) or not passing (row in rable coloured red and no tick appears). Piranha's behaviour treats negative controls as passing if there are fewer than the configured minimum number of reads in the sample (Default: 50 reads) and positive controls as passing if there is more than the minimum number of reads in the sample for the positive reference (Default: 50 reads).

Positive control reference

The positive control sequence can be configured. By default it is set to a a CoxsackievirusA20 reference sequence present in the references FASTA file supplied with piranha (CoxsackievirusA20_AF499642). The Poliovirus Sequencing Consortium has developed a positive CoxA20 control, which can be supplied to user labs upon request. If a different positive control is being used within the sequencing run, this can be configured with the--positive-references flag. The user can specify one or more references to match as a positive control, however they must be present in the references file supplied to piranha. A custom references file can be supplied alongside the run if the positive control used doesn't have a close match in the piranha reference database. Note however, that if your positive control is supplied within the file and does have an existing close match within the supplied piranha reference file, the existence of two very close sequences within a FASTA file may lead to competition for minimap2, and result in a reduced mapping quality score.

If you have a reference positive control sequence that is not closely matched with anything already existing in the piranha database and can share the sequence, we would be happy to include it in the distributed file. In this way, piranha could match this sequence for the positive control without the need to supply a custom reference file.

Matching reads against the reference file

Piranha takes nanopore sequencing reads (fastq) and matches them against a reference set of enterovirus VP1 sequences. This matching is currently performed with minimap2 (Li 2018), run in the mode configured for noisynanopore data (-x map-ont). For more information on the details of what this pre-configured setting is, visit this link. In addition to these pre-configured settings, within piranha we currently only consider primary alignments (i.e. the best hit) for each read against the reference file, rather than all potential matches.

This reference set includes representatives of wild-type poliovirus 1, 2 and 3, as well as Sabin-1, 2 and 3 as well as a selection of other enterovirus VP1 sequences too.

IMPORTANT: Nanopore reads can have up to ~10% error rate and sporadic mapping to some references can occur at the read level, particularly when sequencing at great depth. When looking at the output read counts in the piranha report, this must be interpreted with caution and awareness of the underlying noise in the data. For example, low numbers of reads may map to WPV2 or WPV3 as they are present in the default database. This DOES NOT mean that WPV2 or WPV3 is present in your sample. Hits with greater than the minimun read count are highlighted in the report in the sample composition table (Table 2). If the read count passes the minimum thresholds (number of reads and percentage of sample) the reads will be taken forward for consensus building. It is only possible at the consensus level to get an accurate picture of that read population in your sample.

Read thresholds

There are a number of default thresholds applied, which can be overwritten depending on your purposes.

-n,--min-read-length
-x, --max-read-length

The default read length range filters accepts reads between 1000 and 1300 nucleotide bases in length.

-d, --min-read-depth
-p, --min-read-pcent

These parameters set the minimum number of reads hitting a particular reference in the reference file (and the minimum percentage of reads within the sample) that are necessary to create a binned read group and attempt to make a consensus sequence for that particular sample. By default a minimum of 50 reads are necessary to build a consensus sequence and a minimum of 2% of the sample is required to be represented by that particular reference before it will attempt to create a consensus for this.

Similarly, mapping quality and alignment block length are used as quality filters within piranha to minimise the risk of any false positive results. By default, the minimum mapping quality is set to 15, on a phred scale of 0 to 60. Note that if you are using a non-default references FASTA file, having identical or near-identical sequences in the references file this may superficially decrease the mapping quality as there is no clear individual hit in the file. It is for this reason we only include a single Sabin reference for individual poliovirus types and no additional VDPVs.

The length of the alignment block is also used to filter out sporadic, incorrect hits. The minimum block length is 0.6 (or 60%) of the minimum read length by default, so an alignment block length of 600 or more.

-q, --min-mapping-quality
-a, --min-aln-block

How many reads should I count as a signal?

We have set the minimum read depth to be 50 reads in order to attempt to make a consensus. Within piranha, we run minimap2 to map reads against the background reference panel. The top hit within the background reference panel is reported, by default showing the "ddns_group" field. The categories displayed are:

Sabin1-Related
Sabin2-Related
Sabin3-Related
WPV1
WPV2
WPV3
NonPolioEV
Unmapped
PositiveControl (if a positive control has been used)

When sequencing samples at high depth, using mapping software on the raw nanopore reads (which are error prone) can lead to a certain level of noise. Hits above the minimum read depth threshold (Default >50) are highlighted in red in the final report. If the population of reads mapping to a particular reference successfully makes a consensus sequence at the end of the piranha pipeline, this is an indication of a genuine population of reads rather than noise.

Importantly, the reference that is hit within the background references file does not fully indicate what consensus sequence will be generated from the read population. Further phylogenetic analysis should be performed to confirm the identity of the sequence that you get.

Custom reference file

Users can specify their own reference file or by default piranha will access the reference file packaged with the software.

The reference file must be in fasta format and, within that file, the first field must be the reference ID (without spaces). There must also be ddns_group specified in the header field. This reference file can include additional information in the following format:

>Poliovirus3-wt_JN812657 ddns_group=WPV3 species=Poliovirus3-wt 
GGGGTGGACGATCTGATAACAGAA...
>Poliovirus3-Sabin_AY184221 ddns_group=Sabin3-related species=Sabin3-related
GGTATTGAAGATTTGACTTCTGAA...

Notably, ddns_group allows multiple references to be aggregated together/ anonymised within the final report. Current compatibility within piranha allows ddns_group fields to include: WPV1,WPV2,WPV3,Sabin1-related,Sabin2-related,Sabin3-related and NonPolioEV.

Initial processing pipeline

Gathers all read files for a given barcode together
Filters these reads by length (Default only reads between 1000 and 1300 bases are included for further analysis)
The reads are mapped against a panel of references. By default there is a reference panel included as part of piranha. It includes VP1 sequences from various wild-type polio viruses, reference Sabin-1, Sabin-2 and Sabin-3 sequences for identification of Sabin-like or VDPV polioviruses, and also a number of non-polio enterovirus reference seqeuences. A custom VP1 fasta file can be supplied with the -r flag.
The read map files are parsed to assign each read to the closest virus sequence in the reference panel. This assigns each read a broad category of either Sabin-1 like, Sabin-2 like, Sabin-3 like, wild-type Poliovirus (1, 2, or 3), non-polio enterovirus, or unmapped.
These broad category assignments are used to bin reads for further downstream analysis. Any bin with greater than the minimum read threshold (Default 50 reads, but can be customised) and minimum read percentage (default 10% of sample, but can be customised) is written out in a separate fastq file which will be used to generate the broad-category consensus sequence.

Analysis for each mapped read bin

For each bin, a consensus sequence is generated using medaka and variation information is calculated for each site in the alignment against the reference. This calculates the consensus variants within each sample.
The variants that are flagged by medaka are assessed for read co-occurance to tease apart variant haplotypes within the sample.
For the entire run, and for each individual barcode/ sample, an interactive html report is generated summarising the information.

Configuring analysis options

Piranha is configured by default to work with the nested amplification VP1 protocol developed by the Polio Sequencing Consortium. If you're testing another protocol or want to modify the default analysis behaviour you can configure thresholds and models with various command line arguments (the snake case of the long-form arguments can be supplied in an input config file, see example here).

`--medaka-model`

Piranha uses medaka to generate consensus sequences. Medaka is software developed by Oxford nanopore technologies that explicitly deals with the error profile of nanopore data. Unlike Illumina sequencing, the error profile in sequencing reads generated by nanopore technology is not randomly distributed. Areas of low complexity, repetitive regions and particularly homopolymeric runs are lower in accuracy. Software like medaka (instead of traditional variant calling and consensus generation software) uses machine learning methods to compensate for this non-random error profile and for a long time the leading variant calling software for nanopore sequencing has used machine learning methods (e.g. nanopolish and medaka). For detailed information about the internal workings of medaka, see this helpful tutorial.

By default the medaka model run is r941_min_high_g360. You should ensure to run the appropriate model for your data. The format of medaka model names is of:

{pore}_{device}_{caller variant}_{caller version}

so the default model in use assumes you've run

a R9.4.1 flow cell
on a MinION (or also for GridION can leave it set to min- only need to change min to prom if it was a PromethION run)
in high accuracy mode (if you've run in fast mode this should be changed)
with Guppy version 3.6.0

To see the available list of medaka models installed you can use the piranha command

(piranha) aine$ piranha --medaka-list-models

and piranha will check which ones you have installed with your version of medaka, print them to screen and exit. Medaka may lag behind the latest models of guppy available, so use the closest model you can to what you're running but also be aware that the exact model for your version of Guppy may not yet exist if you're on the cutting edge of Guppy versions. Conversely, if you are running a very old version there may also not be an exact medaka model for your data. The developers of medaka suggest to run the correct model for best results. They also state:

Where a version of Guppy has been used without an exactly corresponding medaka model, the medaka model with the highest version equal to or less than the guppy version should be selected.

Process enterovirus sequences NEW FEATURE

Piranha now has the flexibility to cluster sequencing reads by a specified reference group field (-rg / --reference-group-field, or reference_group_field in a config file). By default, this field is set to ddns_group and any records in the reference file must contain this as an annotation in the header description. The reference file that is supplied with piranha contains this field in all records and the values fall into the following categories:

Sabin1-related
Sabin2-related
Sabin3-related
WPV1
WPV2
WPV3
NonPolioEV

These fields focus on poliovirus detection and all non-polio enterovirus sequences are collaposed into a single category.

If your use of piranha extends to other enteroviruses, it is now possible to run a more tailored analysis. The reference file that is supplied with piranha has the following format:

>Poliovirus1-Sabin_AY184219 ddns_group=Sabin1-related species=Sabin1-related cluster=Sabin1-related
GGGTTAG...CCACATAT
>CoxsackievirusA21_EF015028 ddns_group=NonPolioEV species=CoxsackievirusA21 cluster=CoxsackievirusA21
AGATGCA...CAGTGAGA

With the -rg flag, you can use the reference file supplied to cluster by ddns_group (default), species or cluster. Then, piranha will dynamically calculate which categories are present during the run. For example, running piranha with -rg species will enable clusters of CoxsackievirusA21, or other enterovirus species to be identified and have a consensus sequence generated during piranha analysis.

It is also possible to supply a custom reference file with references and fields of interest to piranha in the following format:

>REF1 my_custom_field=group1
ACTGT....CTAGCTA
>REF2 my_custom_field=group1
ACTGT....CTAGCTA
>REF3 my_custom_field=group1
ACTGT....CTAGCTA
>REF4 my_custom_field=group2
ACTGT....CTAGCTA
>REF5 my_custom_field=group2
ACTGT....CTAGCTA

and in this example, if -rg my_custom_field is supplied, any reads that map to REF1, REF2 or REF3 will be clustered into 'group1' and any reads that map to REF4 or REF5 will be clustered into 'group2'.

Any values will reads mapping will be shown in the final reports.

Pitfalls to note:

A caveat with this new approach is that our reference file has been constructed with poliovirus primarily in mind. There are quite a few enterovirus references in the file, covering 113 species groups, but there is likely to be gaps in the broad enterovirus diversity covered in this reference file. If you have species of interest you want to be detected, it may be worth supplying a custom reference file.
Enteroviruses are a very diverse group of viruses, it may well be that portions of a read match to something in our reference set, but the rest is too divergent to map. In this case, lowering the alignment block length minimum requirement may enable partial sequences to be recovered. Here, the limitation is the reference-based approach and supplementing the reference file with additional reference sequences may resolve this.
Mapping against a reference set that contains multiple sequences that are very similar may significantly decrease the alignment quality score from minimap2. This case arises when there are 2 equally likely matches in the database to your sequencing reaad and the software cannot confidently call the best hit. In the default reference set, we only include a single Sabin1, Sabin2 and Sabin3 reference respectively, as including VDPV sequences in the database may confuse the mapping software. By default, reads with low alignment mapping scores will be filtered out. By lowering the mapping quality threshold, these split cases may be included in the analysis for enteroviruses.

Suggested configuration settings can be found here and we would recommend to only use this mode if the user can sufficiently appreciate the limitations the reference file composition may have on detection.

With the supplied config file, piranha can be run as follows:

piranha -c config.ev.yaml

Experimental haplotype calling pipeline NEW FEATURE

There is now an experimental haplotype calling pipeline that uses freebayes for initial variant calling and flopp for read phasing with the called variants. It has a number internal QC steps for merging identical haplotypes. This pipeline requires further validation, but can theoretically produce multiple consensus sequences for each poliovirus population present within a sample. The pipeline has mostly be tested on VDPVs of two mixtures, but will be further assessed.

To try out the novel haplotype calling pipeline use the --run-haplotyping flag.

Optional phylogenetics module NEW FEATURE

Piranha allows the user to optionally run a phylogenetics module in addition to variant calling and consensus builing. If you have previously installed piranha, are 3 additional dependencies needed if you wish to run this module:

IQTREE2
mafft
jclusterfunk The latest environment file contains this dependencies, so to install you can just update your environment (conda env update -f environment.yml) or run using the latest image for piranha GUI.

This module will cluster any consensus sequences generated during the run into reference_group, so either Sabin1-related, Sabin2-related, Sabin3-related or WPV1 and will ultimately build one maximum-likelihood phylogeny for each reference group with consensus sequences in a given sequencing run. To annotate the phylogeny with certain metadata from the barcodes.csv file, specify columns to include with -pcol/--phylo-metadata-columns.

Piranha then extracts any relevant reference sequences from the installed reference file (identified by having ddns_group=Sabin1-related in their sequence header, or whichever reference group the relevant phylogeny will be for).

An optional set of local sequences can be supplied to supplement the phylogenetic analysis. To supply them to piranha, point to the correct directory using -sd,--supplementary-datadir. The sequence files should be in FASTA format, but do not need to be aligned. To allow piranha to assign the sequences to the relevant phylogeny, the sequence files should have the reference group annotated in the header in the format ddns_group=Sabin1-related, for example.

This supplementary sequence files can be accompanied with csv metadata files (one row per supplementary sequence) and this metadata can be included in the final report and annotated onto the phylogenies (-smcol/--supplementary-metadata-columns). By default, the metadata is matched to the FASTA sequence name with a column titled sequence_name but this header name can be configured by specifying -smid/--supplementary-metadata-id-column.

Piranha will iterate accross the directory supplied and amalgamate the FASTA files, retaining any sequences with ddns_group=X in the header description, where X can be one of Sabin1-related, Sabin2-related, Sabin3-related or WPV1. It then will read in every csv file it detects in this directory and attempts to match any metadata to the gathered fasta records. These will be added to the relevant phylogenies.

The phylogenetic pipeline is activated by running with the flag -rp/--run-phylo, which then triggers the following analysis steps:

Amalgamate the newly generated consensus sequences for all barcodes into their respective reference groups.
Combine with the relevant Sabin or wild-type poliovirus references.
Combine in any relevant supplementary sequences provided.
Add in an outgroup sequence for each reference group phylogeny.
Determine what metadata is provided and which columns should be used for tree annotation (this will allow the user to colour the tips in the report by certain metadata).
For each reference group, align the combined FASTA file using mafft.
Build a phylogeny using IQTREE2 with a HKY substitution model and zero branches collapsed, with the specified outgroup sequence.
Prune off the outgroup sequences using jclusterfunk.
Annotate the tree newick files with the specified metadata (Default: just whether it's a new consensus sequence or not).
Extract phylogenetic trees and embed in interactive report.

Update local database NEW FEATURE

If you supply a path to the -sd,--supplementary-datadir for the phylogenetics module, you have the option of updating this data directory with the new consesnsus sequences generated during the piranaha analysis. If you run with the -ud,--update-local-database flag, piranha will write out the new sequences and any accompanying metadata supplied into the directory provided.

The files written out will be in the format runname.today.fasta and runname.today.csv. For example, if your runname supplied is MIN001 and today's date is 2023-11-05, the files written will be:

MIN001.2023-11-05.fasta
MIN001.2023-11-05.csv with the newly generated consensus sequences and accompanying metadata from that run.

Note: if supplying the supplementary directory to piranha on a subsequent run, your updated local database will be included in the phylogenetics. However, piranha will ignore any files with identical runname.today patterns to the active run. So, if your current run would produce files called MIN001.2023-11-05.fasta and MIN001.2023-11-05.csv, if those files already exist in the supplementary data directory, they will be ignored. This is to avoid conflicts if piranha is run multiple times on the same data.

Output options

By default the output directory will be created in the current working directory and will be named analysis-YYYY-MM-DD, where YYYY-MM-DD is today's date. This output can be configured in a number of ways. For example, the prefix analysis can be overwritten by using the -pre/--output-prefix new_prefix flag (or output_prefix: new_prefix in a config file) and this will change the default behaviour to new_prefix_YYYY-MM-DD. It's good practice not to include spaces or special characters in your directory names.

To completely overwrite the output directory name (rather than just the prefix), you can run the -o/--outdir flag (or supply outdir in the config file).

If you're running multiple analyses from the same directory and not supplying new directory names, piranha will append an incrementing number to the end of the directory name so that the contents within the previous run output don't cause conflicts within the new run. To change this default behaviour and overwrite the previous directory, use the --overwrite flag. Note: this will wipe all the contents within the analysis-YYYY-MM-DD directory and re-populate it with the output of the new piranha run.

Temporary files

By default, piranha removes the intermediate files it produces during the analysis run. If you want to access the intermediate files (for example, the bam files, the read bins etc.) run with --no-temp and all intermediate files will be kept.

By default temporary files are stored in $TMPDIR, which then gets wiped when the process completes. If you want to store temp files elsewhere this can be done with --tempdir.

Archiving sequencing reads

Piranha can handle archiving the read directory supplied for analysis. If the --archive-fastq flag is supplied to piranha (or archive_fastq in a config file), piranha will copy the directory tree (so all files and sub-directories) in the readdir path to the output directory under the name "fastq_pass".

If you want to customise where piranha copies the sequencing read directory to, supply the --archivedir argument with the path you wish to write the reads to and it will overwrite the default option to write the reads to the output directory.

Customising the report

The output report can include some information about the sequencing run, the name of the individual running the report and the institute doing the sequencing. To give the report a specific run name (rather than the default title Nanopore sequencing report) supply the new name with the command line flag --runname or as runname: in the config file. Similarly, to enable the report to display the name of the user and institute, provide --username and --institute (or within the config file).

A free-text --notes field can be supplied that will get embedded into the top of the html report.

The orientation of barcodes in the vizualisation representing a 96-well plate can be configured. By default the barcodes increment from 1 to 96 in a vertical orientation (down each column in turn). By using the --orientation flag on the command line or orientation key in a configuration file, either the default vertical or horizontal (which will increment the numbers from 1-96 across each row in turn) can be specified. If well is supplied as a column in the barcode.csv file, this default orientation will be overwritten. Format for well specification is as a standard 96-well plate: A01 to H12.

Output language

The reports are available in English (default) and in French.

Output report description

Header

The header includes the title of the run (supply with --runname) and the date of report generation.

User and Institute

Optionally the report can also display user and institute information beneath the header (--username and --institute flags).

Table 1 | Sample summary information

This table gives summary information about which viruses were identified within each sample and, for Sabin-related polioviruses, the number of mutations away from Sabin. It also has a link that allows you to download a particular consensus sequence. The rows can be sorted by each column in either ascending or descending order by clicking on the column header. By default, the table is sorted by sample name. It's possible to search the table by typing in the text box on the right-hand side. Clicking on the sample name within the table will redirect the user to the individual sample reports. Be aware this link works in situ, when path to the sample report relative to the main output report has not changed, but if the report is moved or distributed without the barcode reports, this link will no longer function.

Under the table header, there's a dropdown menu that gives options to export the displayed table (either by copying to the clipboard, as a csv or directly printing it). By clicking on rows within the table, it's possible to select the subset of rows you're interested in exporting and this is what will be sent to the clipboard, file or printer.

To the right, there is a button to download the detailed table. This additional table has a compiled set of data from the sequencing run. By default the information provided in this this file are as follows.

First columns:

sample, barcode, EPID, institute,...

Then any additional fields supplied in the barcodes.csv for each sample (e.g. date of collection, date of sequencing)

Then information about each reference group in turn. For example, Sabin1-related:

Sabin1-related|closest_reference,Sabin1-related|num_reads,Sabin1-related|nt_diff_from_reference,Sabin1-related|pcent_match,Sabin1-related|classification,...

Finally a generic comments column that can be modified later (or prepopulated if provided in the barcodes.csv).

An example of this detailed file can be found here.

Table 2 | Composition of samples

This table displays read counts for each sample that have mapped to each of the reference groups. Similar to Table 1, this table can be sorted, searched and exported.

Table 3 | Flagged samples

This table will only appear if there are identical sequences in your sequencing run. It flags samples that contain consensus sequences that are completely identical. This may be a sign of contamination within the sequencing run, but doesn't necessarily mean this. This table serves as a prompt to investigate why these sequences may be identical.

Table 4 | Controls

If a negative and/or positive control are included in your sequencing run (piranha will automatically detect them if their sample name is negative or positive, or the user can specify the name of their controls with the command line flags or by providing them in a config file). If the control "passes" (i.e. has fewer than 50 reads for the negative control or has more than 100 reads in the NonPolioEV category for the positive control), the row will be coloured green and have a tick mark under the "Pass" column. If the controls fail, the row will be coloured red. If the controls fail, this may be an indication of a failed sequencing run.

Sample report

This report gives additional information specific to the sequences generated for each sample.

Summary of sample content

This table summarises the reference groups found within the sample. Clicking on the reference group within the table will link the user down to the appropriate section of the report.

VP1 sequences

All consensus sequences generated for this sample are available within this box in fasta format. The header of the fasta file by default is of the format:

>sample|barcode|reference_group|reference|number_nt_diffs_from_reference|variants
AAGTCAGTCATCAGCTGACT...CAGTGATCGAGCTAGTAT

Additional information can be detected and appended to this fasta header. By default, if a date column or EPID column is provided in the barcodes.csv file, this will be appended to the header. Piranha has a flag --all-metadata-to-header that will append all metadata fields provided in the input csv file to the header.

This fasta file can be highlighted and copied to the clipboard, or accessed as part of the "published_data" directory in the output directory of piranha.

Reference group specific report

This section of the report exists for each reference group identified within the sample.

Table | Reference group table

Firstly, a table summarises the number of mutations and what mutations have been identified relative to the closest reference.

Figure | Variation (errors + mutations) across reference

The variation figure shows the noise within the data in contrast to the variants that get called within the sample. Each point represents the percentage alternative allele present at that site in the VP1 sequence. The baseline reference for Sabin-related sequences is the relevant Sabin reference, whereas for the other reference groups (wild-type poliovirus and non-polio enterovirus) the baseline reference is the consensus sequence generated. This means that for Sabin-related sequences, the differences relative to Sabin are highlighted and for the other read populations within the sample, differences within sample relative to the consensus of that read population are highlighted. The SNPs called by Medaka relative to the chosen baseline reference (either Sabin or consensus sequence) are highlighted in dark green and insertion-deletion mutations (if any) are highlighted in yellow. Masked variants are highlighted in red. A variant will be masked if it will cause a frameshift within the VP1 region (like an indel whose length is not a multiple of 3) or if it is part of a string of variants in very close proximity (highlighting an issue around that area).

Figure | snipit plot

This snipit plot (https://github.com/aineniamh/snipit) highlights the consensus differences relative to the reference.

Figure | Percentage co-occurrence of SNPs called against reference

This plot calculates the percentage of reads that share the variants called against the reference, and can give a noisy estimate of haplotypes present within the sample. The calculation only takes bases of high-quality into account, so counts reflect the level of support rather than the level of coverage. For each variant, the percentage displayed is a function of the total high quality bases at that site, and for each variant the cooccurance is a percentage of high quality bases at each site in turn that share both variants. This figure can give an idea if mixed populations are present within the sample.

Full usage

usage: 
    piranha -c <config.yaml> [options]
    piranha -b <barcodes.csv> -i <demultiplexed fastq_dir> [options]

Input options:
  -c CONFIG, --config CONFIG
                        Input config file in yaml format, all command line arguments can be passed via the config file.
  -i READDIR, --readdir READDIR
                        Path to the directory containing fastq read files
  -b BARCODES_CSV, --barcodes-csv BARCODES_CSV
                        CSV file describing which barcodes were used on which sample
  -r REFERENCE_SEQUENCES, --reference-sequences REFERENCE_SEQUENCES
                        Custom reference sequences file.
  -rg REFERENCE_GROUP_FIELD, --reference-group-field REFERENCE_GROUP_FIELD
                        Specify reference description field to group references by. Default: `ddns_group`
  -nc NEGATIVE_CONTROL, --negative-control NEGATIVE_CONTROL
                        Sample name of negative control. If multiple samples, supply as comma-separated string of sample names. E.g.
                        `sample01,sample02` Default: `negative`
  -pc POSITIVE_CONTROL, --positive-control POSITIVE_CONTROL
                        Sample name of positive control. If multiple samples, supply as comma-separated string of sample names. E.g.
                        `sample01,sample02`. Default: `positive`
  -pr POSITIVE_REFERENCES, --positive-references POSITIVE_REFERENCES
                        Comma separated string of sequences in the reference file to class as positive control sequences.

Analysis options:
  -s SAMPLE_TYPE, --sample-type SAMPLE_TYPE
                        Specify sample type. Options: `stool`, `environmental`. Default: `stool`
  -m ANALYSIS_MODE, --analysis-mode ANALYSIS_MODE
                        Specify analysis mode to run. Options: `vp1`,`panev`, `wg`. Default: `vp1`
  --medaka-model MEDAKA_MODEL
                        Medaka model to run analysis using. Default: r941_min_hac_variant_g507
  --medaka-list-models  List available medaka models and exit.
  -q MIN_MAP_QUALITY, --min-map-quality MIN_MAP_QUALITY
                        Minimum mapping quality. Default: 15
  -n MIN_READ_LENGTH, --min-read-length MIN_READ_LENGTH
                        Minimum read length. Default: 1000
  -x MAX_READ_LENGTH, --max-read-length MAX_READ_LENGTH
                        Maximum read length. Default: 1300
  -d MIN_READ_DEPTH, --min-read-depth MIN_READ_DEPTH
                        Minimum read depth required for consensus generation. Default: 50
  -p MIN_READ_PCENT, --min-read-pcent MIN_READ_PCENT
                        Minimum percentage of sample required for consensus generation. Default: 2
  -a MIN_ALN_BLOCK, --min-aln-block MIN_ALN_BLOCK
                        Minimum alignment block length. Default: 0.6*MIN_READ_LENGTH
  --primer-length PRIMER_LENGTH
                        Length of primer sequences to trim off start and end of reads. Default: 30

Haplotyping options:
  -rh, --run-haplotyping
                        Trigger the optional haplotyping module. Additional dependencies may need to be installed.
  -hs HAPLOTYPE_SAMPLE_SIZE, --haplotype-sample-size HAPLOTYPE_SAMPLE_SIZE
                        Number of reads to downsample to for haplotype calling. Default: 3000
  -hf MIN_ALLELE_FREQUENCY, --min-allele-frequency MIN_ALLELE_FREQUENCY
                        Minimum allele frequency to call. Note: setting this below 0.07 may significantly increase run time. Default: 0.07
  -hx MAX_HAPLOTYPES, --max-haplotypes MAX_HAPLOTYPES
                        Maximum number of haplotypes callable within reference group. Default: 4
  -hdist MIN_HAPLOTYPE_DISTANCE, --min-haplotype-distance MIN_HAPLOTYPE_DISTANCE
                        Minimum number of SNPs between haplotypes. Default: 2
  -hd MIN_HAPLOTYPE_DEPTH, --min-haplotype-depth MIN_HAPLOTYPE_DEPTH
                        Minimum number of reads in a given haplotype. Default: 20

Phylogenetics options:
  -rp, --run-phylo      Trigger the optional phylogenetics module. Additional dependencies may need to be installed.
  -sd SUPPLEMENTARY_DATADIR, --supplementary-datadir SUPPLEMENTARY_DATADIR
                        Path to directory containing supplementary sequence FASTA file and optional metadata to be incorporated into phylogenetic
                        analysis.
  -pcol PHYLO_METADATA_COLUMNS, --phylo-metadata-columns PHYLO_METADATA_COLUMNS
                        Columns in the barcodes.csv file to annotate the phylogeny with. Default: ['call', 'sample_date', 'EPID']
  -smcol SUPPLEMENTARY_METADATA_COLUMNS, --supplementary-metadata-columns SUPPLEMENTARY_METADATA_COLUMNS
                        Columns in the supplementary metadata to annotate the phylogeny with. Default: ['location', 'lineage']
  -smid SUPPLEMENTARY_METADATA_ID_COLUMN, --supplementary-metadata-id-column SUPPLEMENTARY_METADATA_ID_COLUMN
                        Column in the supplementary metadata files to match with the supplementary sequences. Default: sequence_name
  -ud, --update-local-database
                        Add newly generated consensus sequences (with a distance greater than a threshold (--local-database-threshold) away from
                        Sabin, if Sabin-related) and associated metadata to the supplementary data directory.
  -dt, --local-database-threshold
                        The threshold beyond which Sabin-related sequences are added to the supplementary data directory if update local database
                        flag used. Default: 6

Output options:
  -o OUTDIR, --outdir OUTDIR
                        Output directory. Default: `analysis-2022-XX-YY`
  -pub PUBLISHDIR, --publishdir PUBLISHDIR
                        Output publish directory. Default: `analysis-2022-XX-YY`
  -pre OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Prefix of output directory & report name: Default: `analysis`
  --datestamp DATESTAMP
                        Append datestamp to directory name when using <-o/--outdir>. Default: <-o/--outdir> without a datestamp
  --overwrite           Overwrite output directory. Default: append an incrementing number if <-o/--outdir> already exists
  -temp TEMPDIR, --tempdir TEMPDIR
                        Specify where you want the temp stuff to go. Default: `$TMPDIR`
  --no-temp             Output all intermediate files. For development/ debugging purposes
  --all-metadata-to-header
                        Parse all fields from input barcode.csv file and include in the output fasta headers. Be aware spaces in metadata will
                        disrupt the record id, so avoid these.
  --language LANGUAGE   Output report language. Options: English, French. Default: English
  --save-config         Output the config file with all parameters used
  --archive-fastq       Write the supplied fastq_pass directory to the output directory.
  --archivedir ARCHIVEDIR
                        Configure where to put the fastq_pass files, default in the output directory.
Misc options:
  --runname RUNNAME     Run name to appear in report. Default: polioDDNS
  --username USERNAME   Username to appear in report. Default: no user name
  --institute INSTITUTE
                        Institute name to appear in report. Default: no institute name
  --notes NOTES         Miscellaneous notes to appear at top of report. Default: no notes
  --orientation ORIENTATION
                        Orientation of barcodes in wells on a 96-well plate. If `well` is supplied as a column in the barcode.csv, this default
                        orientation will be overwritten. Default: `vertical`. Options: `vertical` or `horizontal`.
  -t THREADS, --threads THREADS
                        Number of threads. Default: 1
  --verbose             Print lots of stuff to screen
  -v, --version         show program's version number and exit
  -h, --help

polio-nanopore / piranha

readme