nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
188 stars 118 forks source link

Does the `gtdb` database only include Bacteria? #708

Closed erikrikarddaniel closed 8 months ago

erikrikarddaniel commented 8 months ago

Description of the bug

A colleague ran Ampliseq on a dataset made with archaeon specific primers from an environment with lots of Archaea. Using gtdb as database, she got no Archaea, whereas both with silva and sbdi-gtdb she got almost only Archaea. Could it be that we only download the bac120* 16S sequences when we build the database?

Command used and terminal output

No response

Relevant files

No response

System information

No response

d4straub commented 8 months ago

I had a short look, version 2.8.0 has bac & arc files in the config to download: https://github.com/nf-core/ampliseq/blob/f3c97e1b9088b229d4bcdeb2f9a25f21d6552f8b/conf/ref_databases.config#L30 And this files should be picked up by https://github.com/nf-core/ampliseq/blob/f3c97e1b9088b229d4bcdeb2f9a25f21d6552f8b/bin/taxref_reformat_gtdb.sh#L12-L16. Could be further investigated whether there is a bug, ofc.

erikrikarddaniel commented 8 months ago

OK, I see no problem with that code. We'll make some tests to see if we can replicate the problem. If we can, maybe we need to warn users not to use this for Archaea (or not at all)?

d4straub commented 8 months ago

So as far as I got that was indeed a bug? For all versions? Or only for the most recent one? Might be good to mention in upcoming release notes...

erikrikarddaniel commented 8 months ago

Yes, a bug. From r207 and onwards. (GTDB changed the number of markers in r207.)

d4straub commented 8 months ago

Thanks, that info is in the changelog now.