IUPAC nucleotide ambiguity

torognes / swarm

A robust and fast clustering method for amplicon-based studies

GNU Affero General Public License v3.0

123 stars 23 forks source link

IUPAC nucleotide ambiguity #54

Closed ThomasFort closed 9 years ago

ThomasFort commented 9 years ago

Hello, Many sequences I analyse are not only composed of A, C, G, T or Us. When there is a nucleotide ambiguity, I can also find R, Y, S, W, K, M, M, D, B, H, V and Ns, and swarm return an error. Are you going to improve swarm in order to take those ambiguities into account ? (I don't even now if it's possible) Thanks!

frederic-mahe commented 9 years ago

Hi, the short answer is no.

It would require a major rewriting of swarm, as it was designed and optimized to work with only A, C, G, T (or U). My experience with high-throughput sequencing data is that ambiguous nucleotides are extremely rare, and can be discarded without any noticeable impact on downstream analyzes. I am surprised you have a lot of sequences with ambiguous nucleotides. Are you working with Sanger sequences?

ThomasFort commented 9 years ago

OK, thank you for your answer. No, I'm working on Illumina data. I don't have a large number of those ambiguities, and as qiime, usearch and vsearch support them, I didn't even realize that some were present in my sequences.

I will try to find every ambiguities. If they are rare, I will get rid of the sequences associated and try swarm again. Otherwise I will use the clustering algorithm implemented in vsearch.

Thank you again for your answer! Have a nice day!

frederic-mahe commented 9 years ago

Here is a command to drop sequences containing ambiguous nucleotides:

awk '{if (/^>/) {a = $0} else {if (/^[ACGT]*$/) {printf "%s\n%s\n", a, $0}}}' in.fas > out.fas

assuming your fasta entries are on two lines (one line for the header, one line for the sequence). It is reasonably fast (appr. 3 million sequences per minute on my computer). i hope it will help you prepare your dataset to run swarm.

ThomasFort commented 9 years ago

Thank you, it's working very well!