nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
188 stars 119 forks source link

Report shows wrong taxonomic classification stats for QIIME with UNITE #652

Closed d4straub closed 1 year ago

d4straub commented 1 year ago

Description of the bug

When using UNITE fungi with QIIME2 for taxonomic classification, the statistics in the summary report (results/summary_report/summary_report.html) shows for rank "Kingdom" 100% classification, while all other ranks receive 0%.

This is because UNITE database contains strings such as

k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Aspergillaceae;g__Aspergillus;s__Aspergillus_penicillioides
k__Fungi
k__Fungi;p__Ascomycota

while Greengenes 16S - Version 13_8 produces taxonomic strings such as

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Gallionellales; f__Gallionellaceae
k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; o__Flavobacteriales; f__Flavobacteriaceae; g__Flavobacterium; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Rhodoferax; s__

and parsing for the report takes only the Greengenes format into account with https://github.com/nf-core/ampliseq/blob/4e48b7100302e2576ac1be2ccc7d464253e9d20e/assets/report_template.Rmd#L991-L999

Other taxonomic classifications that I did, i.e. DADA2 with UNITE-Fungi, Kraken2, and SINTAX with UNITE-Fungi (see below), were fine.

Command used and terminal output

nextflow run nf-core/ampliseq -r 2.7.0 -profile cfc --FW_primer CTTGGTCATTTAGAGGAAGTAA --RV_primer GCTGCGTTCTTCATCGATGC --input_fasta "ASV_seqs.fasta" --min_len_asv 1 --dada_ref_taxonomy "unite-fungi=9.0" --sintax_ref_taxonomy "unite-fungi=9.0" --kraken2_ref_tax_custom "https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20231009.tar.gz" --kraken2_assign_taxlevels "D,P,C,O,F,G,S" --qiime_ref_taxonomy "unite-fungi" --outdir reclassification

Relevant files

No response

System information

No response

d4straub commented 1 year ago

Doesnt really fit in here, but the stats of the length filter also seems off: I used --min_len_asv 1 (as above, just to get the distribution figure in the report) and the report says

Filtering omitted all ASVs with length lower than 1 bp.

The number of ASVs was reduced by 27.5 ( 1.51 %), from 1817.5 to 1790 ASVs.

which isnt right, because there were 1790 ASVs already in the input file and no ASV was removed. The figure itself seems to be fine.

d4straub commented 1 year ago

Documentation issues:

d4straub commented 1 year ago

This is in dev now, closing the issue.