omics-lab / VirusTaxo

2 stars 1 forks source link

VirusTaxo: Taxonomic classification of viruses from metagenomic contigs

VirusTaxo has an average accuracy of 93% at genus level across DNA and RNA viruses.

1. Running VirusTaxo

Requirements

Installation

2. Predict virus taxonomy from fasta file using prebuilt database

gdown "https://drive.google.com/uc?id=1gz0n5oHomWjpT0HXsrqh8hTLqmqiqgJs"

# Extract db files
tar -xvzf vt_db_jan21_2024.tar.gz
db file Molecule Usage
DNA_RNA_18451_k20.pkl DNA & RNA Recommended for samples containing both DNA & RNA viruses
DNA_9384_k21.pkl DNA Recommended for samples containing DNA viruses only
RNA_9067_k17.pkl RNA Recommended for samples containing RNA viruses only
python3 predict.py -h

usage: predict.py [-h] --model_path MODEL_PATH --seq SEQ [--output_csv OUTPUT_CSV] [--entropy ENTROPY] [--enrichment_score ENRICHMENT_SCORE]

options:
  -h, --help            show this help message and exit
  --model_path MODEL_PATH
                        Absolute or relative path of pre-built model
  --seq SEQ             Absolute or relative path of fasta sequence file
  --output_csv OUTPUT_CSV
                        Path to save the output CSV file (default: VirusTaxo_taxonomy_output.csv)
  --entropy ENTROPY     Entropy threshold for classification (default: 0.5)
  --enrichment_score ENRICHMENT_SCORE
                        Enrichment score threshold for classification (default: 0.8)
python3 predict.py \
   --model_path /path/DNA_RNA_18451_k20.pkl \ # database file
   --seq ./Dataset/contig.fasta # query fasta file 
Id              Length  Genus           Entropy Enrichment_Score
QuerySeq-1      219     Unclassified    1.000   0.000
QuerySeq-2      720     Betacoronavirus 0.000   0.973
QuerySeq-3      1540    Unknown         0.285   0.820
QuerySeq-4      1330    Lentivirus      0.000   0.987

3. Interpretation of output

4. Build your custom database

python3 build.py \
   --meta ./Dataset/RNA_meta.csv \ # provide your metadata file
   --seq ./Dataset/RNA_seq.fasta \ # provide your fasta file
   --k 17 \
   --saving_path /path/RNA.pkl

5. Method limitation and interpretation

6. Version history

Script Version Date Sequences Download
v1 Genus prediction database.v2_2024 Jan21_2024 DNA=9384 & RNA=9067 here
v1 Genus prediction database.v1_2022 Apr27_2022 DNA=4421 & RNA=2529 here
Used in manuscript database.v1_2022 Apr27_2022 DNA=4421 & RNA=2529 here

7. Contact

Rashedul Islam, PhD (rashedul.gen@gmail.com)

8. Citation

Rajan Saha Raju, Abdullah Al Nahid, Preonath Chondrow Dev, Rashedul Islam. VirusTaxo: Taxonomic classification of viruses from the genome sequence using k-mer enrichment . Genomics, Volume 114, Issue 4, July 2022.