torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Why is U replaced by T when outputting the seeds #93

Closed a1an77 closed 7 years ago

a1an77 commented 7 years ago

Just noticed that when producing the seeds output all U are replaced by T. Even though for clustering purposes they are equivalent, why not keep the original format?

colinbrislawn commented 7 years ago

I think this is related to the grown phase of swarm, which uses an exact hashing algorithm to recruit variants to the swarm with d=1 differences from the centroid. While an alignment algorithm could use a scoring matrix to show that U == T, the hashing employed by swarm works better by using only T for the database, queries, and output centroids.

frederic-mahe commented 7 years ago

Each U is converted to a T internally when loading the sequences in memory. Keeping track of wether or not some sequences contain "U"s would require some major rewriting of swarm's core. Could you please give us some details? do you have mix sequences using both U and T?

If all your sequences have "U"s, then it is trivial to convert swarm's fasta output:

sed -i '/^>/! y/U/T/' rep.fasta