torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

DNA base characters other than ACGT #134

Closed kemin711 closed 5 years ago

kemin711 commented 5 years ago

Error: Illegal character 'N' in sequence on line 28062 This is not an issue, but could be handled by the algorithm. If we encounter N, then we should count the base as different. Not sure how much work is needed. The user can filter out sequences with N (this will removed some data).

frederic-mahe commented 5 years ago

Hi @kemin711 swarm only accepts sequences with unambiguous nucleotides (ACGT). This is a design choice that allows swarm to be very fast, and supporting ambiguous nucleotides would mean a major slowdown.

With Illumina sequencing, ambiguous nucleotides are rare (very few sequences with Ns), and swarm's strict support has never been a problem for me. If for some reason you end up with sequences with Ns, you can filter them out with vsearch, or you can only remove or replace the Ns:

sed '/^>/ ! s/[Nn]//g' input.fasta | swarm