torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

swarm terminates if any sequence in the input fasta file has an N #180

Closed killidude closed 9 months ago

killidude commented 9 months ago

I am not sure if this is a bug or not, but some of the behavior probably is unintended.

Hi Torbjørn,

When I run swarm and even a single sequence in the input fasta file has an N, swarm terminates with: Reading sequences: 0%
Error: Illegal character 'N' in sequence on line xxx.

I understand why it is easier to deal with sequences without ambiguities, but in principle sequences with ambiguous bases can be collapsed or clustered in many instances. Do you plan to implement this feature?

If not, is it possible for swarm to just skip over a sequence that contains an N?

I realize that I can filter my merged sequences in vsearch to remove all sequences with an N, but it would be nicer if swarm dealt with Ns automatically.

Thank you,

Tomas

frederic-mahe commented 9 months ago

Hi @killidude

early in the design of swarm we decided to terminate if a sequence contains an N. In high-throughput sequencing data, sequences with Ns are rare, and a common practice is to discard them. Still having sequences with Ns at the clustering step could be the sign that something went wrong in earlier steps, and requires a user decision. See for example these past issues: #54 and #58 (#134 and #138 are also related).

If your sequences frequently contain Ns, and if you want to keep these sequences in your analyses, I suggest to use swarm's capacity to read streams of data, and to replace Ns with another nucleotide on-the-fly:

printf ">s_1\nN\n" | sed '/^>/ ! s/[Nn]/A/g' | swarm

(dev note: this topic was already covered by several tests, but I've added two specific tests for the record: https://github.com/frederic-mahe/swarm-tests/commit/7c53ea9454f244c6ebd22e3a3325e04c36dd29a3)

killidude commented 9 months ago

Hi @frederic-mahe,

Thanks for the answer. I have very few Ns in my sequence data, none in MiSeq, but a fraction of a percent in HiSeq and a bit more in NovaSeq. So from a practical point of view, the information in these sequences is irrelevant. But your suggestion on how to use sequences with Ns to add number of reads to a cluster as per #58 would minimize information loss and potentially rescue some low read clusters. And that would be interesting.

Cheers,

Tomas