Closed ThomasFort closed 9 years ago
Hi, the short answer is no.
It would require a major rewriting of swarm, as it was designed and optimized to work with only A, C, G, T (or U). My experience with high-throughput sequencing data is that ambiguous nucleotides are extremely rare, and can be discarded without any noticeable impact on downstream analyzes. I am surprised you have a lot of sequences with ambiguous nucleotides. Are you working with Sanger sequences?
OK, thank you for your answer. No, I'm working on Illumina data. I don't have a large number of those ambiguities, and as qiime, usearch and vsearch support them, I didn't even realize that some were present in my sequences.
I will try to find every ambiguities. If they are rare, I will get rid of the sequences associated and try swarm again. Otherwise I will use the clustering algorithm implemented in vsearch.
Thank you again for your answer! Have a nice day!
Here is a command to drop sequences containing ambiguous nucleotides:
awk '{if (/^>/) {a = $0} else {if (/^[ACGT]*$/) {printf "%s\n%s\n", a, $0}}}' in.fas > out.fas
assuming your fasta entries are on two lines (one line for the header, one line for the sequence). It is reasonably fast (appr. 3 million sequences per minute on my computer). i hope it will help you prepare your dataset to run swarm.
Thank you, it's working very well!
Hello, Many sequences I analyse are not only composed of A, C, G, T or Us. When there is a nucleotide ambiguity, I can also find R, Y, S, W, K, M, M, D, B, H, V and Ns, and swarm return an error. Are you going to improve swarm in order to take those ambiguities into account ? (I don't even now if it's possible) Thanks!