Closed joelb123 closed 4 years ago
usearch has two limitations which make it unsuitable for clustering against large sets. The first is licensing which restrict unpaid usage to the 32-bit version. Large NR sets won't fit into the 32-bit version's memory limitations. The second is that all the work has to be done on-the-fly after the set is defined. That means no precalculation on the large NR set, and consequently long run times even if one is properly licensed.
vsearch does not support clustering in protein space, which makes it nearly useless for the max-distance clustering we need.
MinHash clustering (via mash) looks promising. It too is a k-mer based method at heart, like usearch, and it permits pre-calculation on one or more input data sets.
not sure that mash supports proteins- do you know? I may be thinking of the wrong kmer-based clustering tool based on MinHash, though.
looks like it does provide support for aa alphabets; but I think we should discuss further before proceeding. at least running some simple tests before proceeding to code anything to accomodate this functionality seems to be in order.
Deferring this item until we are doing supertrees. MMseqs2 looks promising for this purpose.
This is a science issue in which software plays only a minor role through workable speed.
Inspection of a sample of homology-singleton genes shows that many of them look like crap and do not align to anything in various non-redundant protein sets (except perhaps themselves). Some singletons/orphans are better left behind on the "bone pile" since they will not be phylogenetically connected with anything else in biology, much less a set of gene families.
At the same time, knowing the genes from long-distance relationships makes the families that one calculates a bit better. It may also highlight some called genes as possible contaminants (or possible horizontal transfers) from other species.
Consider whether the clustering step can feasibly accommodate a non-redundant set across all of protein space. Uniprot50 comes to mind.
If clustering with a large NR set is feasible, then there are other interesting gene sets (e.g., the PDB) to consider including from the start.