Open ctb opened 2 years ago
I was thinking about that and was about to write a new issue, but luckily found this :)
So, my thoughts about sourmash cluster
are to work on sourmash compare
output directly.
sourmash compare
has many options that eventually lead to a distance matrix.Distance calculation
sourmash compute
.Clustering
Visualization
Example of hierarchical clustering #915
There are quite a few clustering issues in the issue tracker, and I've put a lot of time in over the years! I'll link in those issues as I find them.
Probably the most important one is this one, https://github.com/sourmash-bio/sourmash/issues/1265, which might be worth reading. tl;dr It's not at all clear to me that there's a big need for principled clustering techniques that use a variety of different distance metrics, and this has more to do with the biology than anything else.
(It's also fairly hard to explain in practical terms what the different cutoffs would mean for the various clustering techniques, whereas I think the greedy approach is straightforward to explain.)
Anyway, the dominant practical use case that keeps on coming up again and again isn't clustering signatures, but clustering genomes (or, really, sequences and/or files). Pretty much every time I've implemented some kind of clustering or cutoff, the next question is "ok, now how can I get the sequences out in those clusters?" Most recently, I've been working on modifying sourmash-uniqify into uniqify-genomes because of https://github.com/spacegraphcats/spacegraphcats/issues/452.
So my thinking at the moment is to implement sourmash cluster independently of sourmash compare, with a focus on the output formats and file manipulations that we need. @bluegenes is implementing ANI and (eventually?) AAI over in https://github.com/sourmash-bio/sourmash/pull/1788, and we already have the various Jaccard and angular similarity metrics implemented. If we want to provide a variety of different clustering approaches in the middle of that, that's fine by me :).
Good visualization options would be great! I am leaning towards having them belong in sourmash cluster rather than sourmash compare, because we could use the subcommand syntax to make them less complicated. sourmash compare is getting kind of complicated for a "single" command, as is sourmash plot :) - but that's ok, they are some of the oldest commands in sourmash, maybe it's time for a revamp!
I'm really excited about having a dedicated cluster
function!
A couple thoughts:
prefetch
, for example? Or maybe better, can cluster optionally produce output similar to prefetch
output, with much more information than comes out of compare
?
prefetch
for a couple reasons - first I am minimizing computations by comparing a representative genome --> all genomes, rather than all x all (doesn't help us with cluster), but the second/more important being that we get much more information from prefetch
-- overlapping bp, jaccard, containment, max containment, etc + soon, estimated ANI. I can imagine wanting to compute all of this information once for a set of files/signatures, then try clustering using different values (e.g. max containment, ANI).Maybe we're already doing this, but q:
compare
, are we loading each pair of signatures separately for each directional comparison (e.g. when comparing with --containment
)? Can we save any time by calculating both directions at once when we have the two relevant files/sigs loaded?I'm really excited about having a dedicated
cluster
function!A couple thoughts:
* Can we enable clustering from pre-calculated distances, e.g. from output of `prefetch`, for example? Or maybe better, can cluster optionally produce output similar to `prefetch` output, with much more information than comes out of `compare`?
This is interesting. We have to think about how to do this well from a UX perspective... could be tricky.
* I've been using `prefetch` for a couple reasons - first I am minimizing computations by comparing a representative genome --> all genomes, rather than all x all (doesn't help us with cluster), but the second/more important being that we get much more information from `prefetch` -- overlapping bp, jaccard, containment, max containment, etc + soon, estimated ANI. I can imagine wanting to compute all of this information once for a set of files/signatures, then try clustering using different values (e.g. max containment, ANI).
Hrm, ok.
There's only a few distance metrics in there - note that plot and compare don't require that things be distance metrics, and clustering by containment or bp overlap also wouldn't require that the measure be a distance metric, but "proper" clustering would.
(The distance metrics are jaccard, max containment, ANI, and cosine similarity.)
Also note that some of these measures (max containment) can only be calculated for scaled sketches, and cos similarity is the only one that can make use of abundance information. I think it's fine to have sourmash cluster
require scaled sketches, tho. Or, we could allow a variety of input file formats including prefetch, perhaps doing something like the picklist argument format.
* Can we enable easily adding additional files/signatures to the comparisons? If you get an extra genome, you don't need to run all by all again, just this genome x all existing.
A good requirement to keep in mind.
Maybe we're already doing this, but q:
* When we do all by all comparisons in `compare`, are we loading each pair of signatures separately for each directional comparison (e.g. when comparing with `--containment`)? Can we save any time by calculating both directions at once when we have the two relevant files/sigs loaded?
We only load them once, and calculate the direction once, for symmetric measures (e.g. distances).
hot take, I think the sourmash component of this functionality should focus on calculating and outputting hash overlaps (and related metrics) and then large scale clustering should be Somebody Else's Problem and not something that we implement in sourmash directly.
The betterplot
plugin supports cluster extraction via dendrogram cuts in the plot2
command. It's nice!
cc @bluegenes
working version here, https://hackmd.io/-sFCAl_3T1qSqH_GvV_pvw - add stuff there, or add thoughts in comments below, but please do not edit this header except to update it with the contents of the hackmd.
thoughts on adding sourmash cluster
sourmash-uniqify seems like a good start! let's turn it into sourmash cluster.
thresholding etc.
by default I think we should support ANI/AAI, but it would be nice to support similarity, containment, and max containment too.
output formats
the current sourmash uniqify provides CSV output, and now also signature merge.
While it sounds complicated to me (:sigh:) but I think we need to support:
use cases and CLI
it'd be nice to support clustering FASTA files directly as I see that as a big use case.
what about
?
As an alternative we could tell people to calculate their own signatures and just make sure to save the right filename in the signatures.