nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
182 stars 115 forks source link

Strategy for PR2 species assignments #593

Closed andand closed 1 year ago

andand commented 1 year ago

Description of feature

According to this issue in PR2 regarding assigning species, sequences with annotation ending with "_sp." may actually belong to properly named species of the same genus (but the data provider may have failed to define them at species level). If these are included when running assignSpecies one may therefor get seemingly multi-species-matching ASVs, although they in fact only match one species. It may thus be a good idea to remove reference sequences with annotation ending with "_sp." before running assignSpecies.

d4straub commented 1 year ago

Thanks for that info. I see annotations ending with "_sp." in PR2 v5.0.0 and v4.14.0. I could modify the assignSpecies input when using any PR2 version to not include any sequences ending with "_sp.". That would affect all versions. Let me know if you disagree.

edit: that could be done by | awk '!/ sp.\n/' RS=">" ORS=">" (remove sequences of names that end with sp.) in bin/taxref_reformat_pr2.sh.

andand commented 1 year ago

Sounds good!

d4straub commented 1 year ago

@jtangrot Do you agree to remove for assignSpecies all annotations ending with sp.? I am asking because it seems valid to me but I am not really into taxonomic databases and would welcome another opinion.

jtangrot commented 1 year ago

I agree, but it should be noted that I work close to Anders (andand), so my opinion is a bit biased in this case...

d4straub commented 1 year ago

Thanks, I see :)

d4straub commented 1 year ago

Would any of you like to review #599 ? Its just what we discussed here, tiny change.

d4straub commented 1 year ago

Merged, will be in next release!