nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
188 stars 119 forks source link

Proposal: Lowest Common Ancestor (LCA) module for assigning taxonomy #610

Open a4000 opened 1 year ago

a4000 commented 1 year ago

Description of feature

I can add a module to Amliseq that would run the LCA scripts from eDNAFlow. More information can be found here: https://github.com/mahsa-mousavi/eDNAFlow#lca-lowest-common-ancestor-script-for-assigning-taxonomy

The input to the module would be the output from blastn, so this module might not work if the user doesn't use blastn. The other input file for this module would be the DADA2_table.tsv file, or alternatively, the curated table produced by LULU if the user chose to use LULU.

The main output file for this module is a tsv file that contains the same information as the input ASV tsv file, plus the number of unique blast hits, and the various taxonomy levels assigned to the ASV (with a "dropped" value in each level where an ASV didn't meet certain thresholds). It's possible this output file may need to be modified to be more compatible with downstream steps in Ampliseq.

d4straub commented 1 year ago

All taxonomic classification subworkflows (see here) use ASV sequence tables (fasta) after the various filtering steps (see e.g. DADA2). In any case, a subworkflow such as here would be great. I propose to use LCA after blastn (or as Daniel Lundin proposed rather vsearch) in a separate taxonomy assignment subworkflow. The output of that subworkflow would be the downstream-compatible taxonomic classification.

a4000 commented 1 year ago

A subworkflow sounds like a good idea, thanks