pacific-2020 / pacific

PACIFIC: A lightweight alignment-free deep-learning classifier of SARS-CoV-2 and co-infecting viral sequences
MIT License
5 stars 2 forks source link

picture

PACIFIC: A lightweight deep-learning classifier of SARS-CoV-2 and co-infecting viral sequences

Read more about PACIFIC here:

Acera Mateos P., Balboa R.F., Easteal S., Eyras E., and Patel, H.R. PACIFIC: A lightweight deep-learning classifier of SARS-CoV-2 and co-infecting RNA viruses. Scientific Reports, 2021.

PACIFIC implements deep learning to classify RNA sequencing reads into human, SARS-CoV-2 or additional respiratory viruses. PACIFIC takes an input FASTA/FASTQ file and predicts the presence of the following viruses and their relative proportions within a sample:

Table of Contents

  1. Quick start
  2. System requirements
  3. Usage
  4. Input
  5. Output
  6. Test and model data

Quick start

Install and test PACIFIC

Note: As the model file is ~300MB, we hosted this file on Cloudstor. The model file can be found here. We recommend that users download this file and place it in the model directory as in the code below:

##Create required virtual environment for running PACIFIC
conda create --name pacific python=3.7.6

##Activate the environment
conda activate pacific

##Install required dependencies
conda install numpy=1.18.1 tensorflow=2.2.0 keras=2.3.1 pandas=1.0.1 scikit-learn=0.21.3 biopython=1.76

#Clone PACIFIC repository
git clone https://github.com/pacific-2020/pacific.git

#Download model file
cd pacific
wget -O model/pacific.01.pacific_9mers_nonGPU.h5 https://cloudstor.aarnet.edu.au/plus/s/Hwg20YRlua9a2OH/download

#Change to test directory and run PACIFIC
cd test
python ../PACIFIC.py \
  -i test.fa.gz \
  -m ../model/pacific.01.pacific_9mers_nonGPU.h5 \
  -t ../model/tokenizer.01.pacific_9mers.pickle \
  -l ../model/label_maker.01.pacific_9mers.pickle \ 
  -f fasta

If installed correctly, PACIFIC should generate pacificoutput_test.fa.gz and test.fa.gz_summary.txt in the test directory, and should provide the following results in the terminal:

From a total of 5000 reads, 0 were discarded (e.g. non-ACGT nucleotides/characters or short reads (<150bp))

   filename            class  # predicted reads  predicted reads (%)  # predicted reads above 0.95  predicted reads above 0.95 (%)
 test.fa.gz       SARS-CoV-2                  0                 0.00                             0                            0.00
 test.fa.gz    Coronaviridae               4997                99.94                          4997                           99.94
 test.fa.gz        Influenza                  0                 0.00                             0                            0.00
 test.fa.gz  Metapneumovirus                  0                 0.00                             0                            0.00
 test.fa.gz       Rhinovirus                  0                 0.00                             0                            0.00
 test.fa.gz            Human                  1                 0.02                             0                            0.00
 test.fa.gz        Discarded                  0                 0.00                             0                            0.00
 test.fa.gz     rc_discarded                  2                 0.04                             1                            0.02

Thank you for using PACIFIC =^)

System requirements

Usage

Run PACIFIC

usage: python PACIFIC.py [options] -i <in.fa>|<in.fq> -m <model> -t <tokenizer> -l <label-maker>

Required arguments:

  -i, --input_file  FASTA/FASTQ input file path
  -m, --model       PACIFIC model file path
  -t, --tokenizer   Tokenizer file path
  -l, --label_maker Label maker object file path

Optional arguments:

  -h, --help            show this help message and exit
  -f <fasta/fastq>, --file_type <fasta/fastq>
                        FASTA or FASTQ training file format [fastq]
  -o <dir>, --outputdir <dir>
                        Path to output directory [.]
  -d <dir>, --tmpdir <dir>
                        Path to tmp directory [outputdir]
  -T <float>, --prediction_threshold <float>
                        Threshold/cutoff for predictions [0.95]
  -c <int>, --chunk_size <int>
                        Number of reads per chunk [100000]
  -v, --version         show program's version number and exit

Input

PACIFIC expects four arguments as input:

PACIFIC allows users to use their own custom training model, tokenizer and label maker files. However, we recommend the use of default parameters and the following files above as input into the program.

Output

PACIFIC will output the following files (using default parameters):

1. pacificoutput_$input.gz: A gzipped fasta file with modified sequence headers from the input fasta file. PACIFIC includes the prediction class and score in the header, as well as whether the sequences are discarded when run through the program e.g. non-ACGT nucleotides/characters or short reads (<150bp) - "discarded" or whether a read fails to meet the reverse complement prediction test "rc_discarded").

For example, the following describes a sequence predicted to be of the Coronaviridae class with a prediction score of 0.97:

>fastaheader:0.97:Coronaviridae

2. $input_summary.txt: A text file which summarises the number of predicted reads for each class and their predicted proportions (%) both in the entire dataset, as well as in predicted reads with a score above 0.95. The table also provides information on discarded and rc_discarded reads. This is the same information that is provided as output in the terminal when PACIFIC is run.

Test and model data

  1. Model and test data are available here
  2. PACIFIC model can be downloaded from here