gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
24 stars 8 forks source link

effect of dereplication method #46

Open slambrechts opened 9 months ago

slambrechts commented 9 months ago

Hi,

In the manual I read:

The reference database can be dereplicated using one of three methods (parameter: '--method') in the 'dereplicate' module:

strict: only unique sequences will be retained, irrespective of taxonomy single_species: for each species in the database, a single sequence is retained uniq_species: for each species in the database, all unique sequences are retained

Can this have a big effect on taxonomic identifications? Did you compare results between the three methods? I am trying to build a 16S database for insects, and I am unsure which of these three methods to choose. Is there one you would recommended in this case, or is this trail and error?

gjeunen commented 9 months ago

Hello @slambrechts,

It is recommended to use --method uniq_species for dereplicating the data to ensure no species information is lost.

Best, Gert-Jan