sheffield-bioinformatics-core / periscope

A tool to quantify sub-genomic RNA (sgRNA) expression in SARS-CoV-2 artic network amplicon nanopore sequencing data.
GNU General Public License v3.0
16 stars 5 forks source link

alt text

periscope

A tool to quantify sub-genomic RNA (sgRNA) expression in SARS-CoV-2 artic network amplicon sequencing data. Initial classification of reads into sub-genomic or not based on https://www.biorxiv.org/content/10.1101/2020.04.10.029454v1.abstract

Citing

Please cite our pre-print if you use this tool in any publications:

https://www.biorxiv.org/content/10.1101/2020.07.01.181867v1

Requirements

periscope runs on MacOS and Linux. We have also confirmed the tool runs under windows 10 unix subsystem.

Installation

git clone https://github.com/sheffield-bioinformatics-core/periscope.git && cd periscope
conda env create -f environment.yml
conda activate periscope
pip install .

Execution

conda activate periscope

periscope \
    --fastq-dir <PATH_TO_DEMUXED_FASTQ> \ (ont only)
    OR
    --fastq <FULL_PATH_OF_FASTQ_FILE(s)> \ (space separated list of fastq files, you MUST use this for Illumina data)
    --output-prefix <PREFIX> \
    --sample <SAMPLE_NAME> \
    --artic-primers <ASSAY_VERSION; V1,V2,V3,V4,2kb,midnight> \
    --resources <PATH_TO_PERISCOPE_RESOURCES_FOLDER> \
    --technology <SEQUECNING TECH; ont or illumina> \
    --threads <THREADS> 

For custom primers use --artic-primers argument followed by:

To view the requirements for these files and advice on how to generate them, go to Custom amplicons and primers section.

output-prefix will be the directory and start of the filename for the output.

So if you put ./SAMPLE1 for this argument outputs will go in the current working directory prefixed by "SAMPLE1".

Note - for illumina data please use --fastq .fastq.gz .fastq.gz and --technology illumina

/tmp Issues

If you have issues with tmp this is because pybedtools writes there. v0.0.3 contains a fix, and you can also specify --tmp and redirect this somewhere else

Pipeline overview

alt text Figure 1. Workflow Overview

Pre-Processing

Counting

This step takes roughly 1minute per 10k reads Our median read count is ~250k and this will take around 25minutes

alt text Figure 2. Read Classification Algorithm

Normalisation

ONT Data

We have taken two approaches, a global normalisation based on mapped read counts or a local normalisation based on gRNA from the same amplicon.

Illumina Data

Ilumina data is still a work in progress, as of v0.0.8 you can get raw sgRNA counts and counts normalised to the average coverage around the ORF TRS start site.

It is worth noting this follows a slightly different algorithm, relying instead on soft clipping. The ratoinale here is that illumina data is more accurate therefore we can detect shorter matches to the leader.

Outputs:

.fastq

A merge of all files in the fastq directory specified as input.

_periscope_counts.csv

The counts of genomic, sub-genomic and normalisation values for known ORFs

_periscope_amplicons.csv

The amplicon by amplicon counts, this file is useful to see where the counts come from. Multiple amplicons may be represented more than once where they may have contributed to more than one ORF.

_periscope_novel_counts.csv

The counts of genomic, sub-genomic and normalisation values for non-canonical ORFs

.bam

minmap2 mapped reads and index with no adjustments made.

_periscope.bam

This is the original input bam file and index created by periscope with the reads specified in the fastq-dir. This file, however, has tags which represent the results of periscope:

These are useful for manual review in IGV or similar genome viewer. You can sort or colour reads by these tags to aid in manual review and figure creation.

Extracting Base Frequencies

To examine the composition of bases at variant sites we have provided this code.

conda activate periscope

gunzip <ARTIC_NETWORK_VCF>.pass.vcf.gz

<PATH_TO_PERISCOPE>/periscope/periscope/scripts/variant_expression.py \
    --periscope-bam <PATH_TO_PERISCOPE_OUTPUT_BAM> \
    --vcf <ARTIC_NETWORK_VCF>.pass.vcf \
    --sample <SAMPLE_NAME> \
    --output-prefix <PREFIX>

_base_counts.csv

Counts of each base at each position

_base_counts.png

Plot of each position and base composition

Running Tests

We provide a sam file for testing the main module of periscope.

reads.sam contains 23 reads which have been manually reviewed for the truth

cd <INSTALLATION_PATH>/periscope/tests

pytest test_search_for_sgRNA.py 

Custom Amplicons and Primers

Custom Amplicons File

Each line must be an amplicon entry with 4, tab-delimited, features:

Example amplicons.bed file:

MN908947.3  30  410 nCoV-2019_1
MN908947.3  320 726 nCoV-2019_2
MN908947.3  642 1028    nCoV-2019_3
MN908947.3  943 1337    nCoV-2019_4
MN908947.3  1242    1651    nCoV-2019_5
MN908947.3  1573    1964    nCoV-2019_6

Custom Primers File

Each line must be a primer entry with 5, tab-delimited, features:

Example primers.bed file:

MN908947.3  30  54  nCoV-2019_1_LEFT    nCoV-2019_1
MN908947.3  385 410 nCoV-2019_1_RIGHT   nCoV-2019_1
MN908947.3  320 342 nCoV-2019_2_LEFT    nCoV-2019_2
MN908947.3  704 726 nCoV-2019_2_RIGHT   nCoV-2019_2
MN908947.3  642 664 nCoV-2019_3_LEFT    nCoV-2019_1
MN908947.3  1004    1028    nCoV-2019_3_RIGHT   nCoV-2019_1

Citations

Long Amplicon Tiling Data

We implemened 2kb amplicon tilling in v0.0.2 from:

SARS-CoV-2 genomes recovered by long amplicon tiling multiplex approach using nanopore sequencing and applicable to other sequencing platforms Paola Cristina Resende, Fernando Couto Motta, Sunando Roy, Luciana Appolinario, Allison Fabri, Joilson Xavier, Kathryn Harris, Aline Rocha Matos, Braulia Caetano, Maria Orgeswalska, Milene Miranda, Cristiana Garcia, André Abreu, Rachel Williams, Judith Breuer, Marilda M Siqueira bioRxiv 2020.04.30.069039; doi: https://doi.org/10.1101/2020.04.30.069039

Why periscope? SUB-genomic RNA, SUB-marine, periscope.