jhuapl-bio / taxtriage

TaxTriage is a Nextflow workflow designed to agnostically identify and classify microbial organisms within short- or long-read metagenomic NGS data. This flexible tool was developed with various use-cases of mNGS in mind.
MIT License
21 stars 4 forks source link

Add Centrifuge and metaPhlan2 #27

Open Merritt-Brian opened 1 year ago

Merritt-Brian commented 1 year ago

Description of feature

Adding 2 classifier approaches

Centrifuge and metaPhlan2

Merritt-Brian commented 1 year ago

metaPhlan does not generated classified/unclassified fastq file as an optional output. Use either awk or bioconda (python script) to filter out only reads that align to the hits. You'll need to import the classifier outfile from metaPhlan to figure out what reads classified vs didn't

erinyoung commented 10 months ago

It would be great to have a classifier that's built for nanopore reads (or other long-read sequencing methods) like spumoni. Kraken2 can have issues with nanopore's higher error rates.

aretchless commented 2 months ago

Hi Brian. I used to love Metaphlan2, but unfortunately don't find it too useful anymore. My recollection is that there was no reliable way to update the reference database and the developers did not have regular releases (hence SARS-CoV-2 may be missing). Metaphlan4 dropped viruses (but can still use the Metaphlan3 algorithm)... but overall just was looking too hard to maintain (even though I love the strategy they use and the high specificity).

On this topic, I've found Diamond2 to be really helpful when dealing with novel viruses.... though it's species-level assignments tend to be very noisy. https://github.com/bbuchfink/diamond

Merritt-Brian commented 2 months ago

I agree on the approach with Diamond as an alternative more more novel discoveries. It seems like most nucleotide classifiers (be they metagenomics-based or standard aligners), have less performance to aa-based alignment. Some preliminary testing with a novel (~55% identity) to a viral genome showed a dramatic improvement in detection F1 scores. We will optimize and introduce it into the pipeline (likely as a mandatory step) next month @aretchless