torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
656 stars 122 forks source link

Implement the search_global command #132

Open torognes opened 8 years ago

torognes commented 8 years ago

Should be simple to add. Together with the other issues that suggests other usearch 7 commands to be implemented, this will make the set of usearch 7 commands for nucleotide sequences complete, as far as I can see. Except for the cluster_otus command.

kunstner commented 8 years ago

Do you have any plans to implement the cluster_otus command as well? This is the only command I still use from the usearch pipeline. All other commands are already replaced by vsearch.

torognes commented 8 years ago

Thanks for the suggestion. We do not have any plans to implement cluster_otus. I'll look into it and see how much it will take.

colinbrislawn commented 8 years ago

:+1:

Of course specifics are scarce. My understanding was that it performs like --cluster_smallmem except that when a read does not match to an existing centroid, it is passed through --uparse_ref before it is allowed to become a new centroid. The --uparse_ref algorithm attempts to explain how a new read could derive from existing reads in a database, in a way that sounds a lot like --uchime_denovo. Implementing @frederic-mahe's uchime suggestions as an internal step of OTU picking could yield a solid parity of --cluster_otus. https://github.com/torognes/vsearch/issues/118#issuecomment-193178967

http://www.drive5.com/usearch/manual/uparseotu_algo.html http://www.drive5.com/usearch/manual/cmd_cluster_otus.html http://www.drive5.com/usearch/manual/uparseref_algo.html

@kunstner, if I may ask, why choose uparse over another clustering algorithm? What qualities would you hope for in a vsearch implementation?

kunstner commented 8 years ago

@colinbrislawn: I use it for microbiome data. Actually, I have a quite smoothly running pipeline using vsearch/usearch for preprocessing and mothur for classification and Otu binning. Unfortunately, running time is quite long using this approach and it is very demanding with respect to either RAM or disk space. I was looking for an alternative approach and came across the cluster_otus command. The results look quite similar to the results I obtained by mothur (which isn't the case if I use the other cluster commands implemented in usearch or vsearch). My second aim is to use a pipeline completely based on open source software which scales nicely with huge data sets. Mothur is not a good option in this case if I have to test different parameters.

colinbrislawn commented 8 years ago

Hi @kunstner, Thanks for telling me a little more about your pipeline. I've used the uparse pipeline right when it came out, but like you I value open-source science and so I switched to vsearch in 2015. VSEARCH definitely scales well and swarm scales even better, although its definition of OTU is a little esoteric and has not yet garnered the popularity it deserves.

I have not used mothur for clustering. Does it mitigate OTU inflation really well like uparse does? Colin

kunstner commented 8 years ago

Hi @colinbrislawn,

my personal experience is that mothur usually mitigates OTU inflation well. But I have some data sets (with lots of samples sequenced) with quite a lot of very rare OTUs which I did not get using uparse. But for most of the data I don't see this problem. Unfortunately, mothur has another problem. For larger data sets (MiSeq data, >400 samples), it is difficult to obtain the representative sequence for each OTU if a distance based method is applied.

Axel

frederic-mahe commented 5 years ago

For future reference, usearch_global stands for fast database search, and search_global stands for slow database search (no heuristics, can detect arbitrary low pairwise identities).