ctb commented 2 years ago

cc @bluegenes

working version here, https://hackmd.io/-sFCAl_3T1qSqH_GvV_pvw - add stuff there, or add thoughts in comments below, but please do not edit this header except to update it with the contents of the hackmd.

thoughts on adding sourmash cluster

sourmash-uniqify seems like a good start! let's turn it into sourmash cluster.

thresholding etc.

by default I think we should support ANI/AAI, but it would be nice to support similarity, containment, and max containment too.

output formats

the current sourmash uniqify provides CSV output, and now also signature merge.

While it sounds complicated to me (:sigh:) but I think we need to support:

some kind of CSV output
copy/saving of signatures into clusters
merging of signatures

use cases and CLI

it'd be nice to support clustering FASTA files directly as I see that as a big use case.

what about

sourmash cluster signatures|sigs
sourmash cluster genomes|files

?

in the case of genomes, we would support using pre-existing signatures ($genome.sig), or calculating them (potentially only in memory? --no-save-sigs?)
support --protein, --dna, --dayhoff, --hp and also parameter strings -p
by default, could do it with ANI.

As an alternative we could tell people to calculate their own signatures and just make sure to save the right filename in the signatures.

mr-eyes commented 2 years ago

I was thinking about that and was about to write a new issue, but luckily found this :)

So, my thoughts about sourmash cluster are to work on sourmash compare output directly.

sourmash compare has many options that eventually lead to a distance matrix.
The distance matrix is I think the only thing we need here for the clustering.

Distance calculation

This should go in sourmash compute.
The distance matrix can be produced using [Jaccard, containment, max_containment, ANI, shared_kmers, etc ...]
I liked the way they listed multiple distance estimation methods in Simka https://peerj.com/articles/cs-94/#table-1

Clustering

Clustering can be done in many flavors like graph-clustering, hierarchical clustering, etc...
Each clustering technique will produce different output. i.e., graph clustering will produce connected components, while hierarchical will produce a tree.

Visualization

It would be nice to add some nice visualization (hopefully interactive) like graphs, trees, dendrograms, etc ...

Example of hierarchical clustering #915

ctb commented 2 years ago

There are quite a few clustering issues in the issue tracker, and I've put a lot of time in over the years! I'll link in those issues as I find them.

Probably the most important one is this one, https://github.com/sourmash-bio/sourmash/issues/1265, which might be worth reading. tl;dr It's not at all clear to me that there's a big need for principled clustering techniques that use a variety of different distance metrics, and this has more to do with the biology than anything else.

(It's also fairly hard to explain in practical terms what the different cutoffs would mean for the various clustering techniques, whereas I think the greedy approach is straightforward to explain.)

Anyway, the dominant practical use case that keeps on coming up again and again isn't clustering signatures, but clustering genomes (or, really, sequences and/or files). Pretty much every time I've implemented some kind of clustering or cutoff, the next question is "ok, now how can I get the sequences out in those clusters?" Most recently, I've been working on modifying sourmash-uniqify into uniqify-genomes because of https://github.com/spacegraphcats/spacegraphcats/issues/452.

So my thinking at the moment is to implement sourmash cluster independently of sourmash compare, with a focus on the output formats and file manipulations that we need. @bluegenes is implementing ANI and (eventually?) AAI over in https://github.com/sourmash-bio/sourmash/pull/1788, and we already have the various Jaccard and angular similarity metrics implemented. If we want to provide a variety of different clustering approaches in the middle of that, that's fine by me :).

Good visualization options would be great! I am leaning towards having them belong in sourmash cluster rather than sourmash compare, because we could use the subcommand syntax to make them less complicated. sourmash compare is getting kind of complicated for a "single" command, as is sourmash plot :) - but that's ok, they are some of the oldest commands in sourmash, maybe it's time for a revamp!

bluegenes commented 2 years ago

I'm really excited about having a dedicated cluster function!

A couple thoughts:

Can we enable clustering from pre-calculated distances, e.g. from output of prefetch, for example? Or maybe better, can cluster optionally produce output similar to prefetch output, with much more information than comes out of compare?
- I've been using prefetch for a couple reasons - first I am minimizing computations by comparing a representative genome --> all genomes, rather than all x all (doesn't help us with cluster), but the second/more important being that we get much more information from prefetch -- overlapping bp, jaccard, containment, max containment, etc + soon, estimated ANI. I can imagine wanting to compute all of this information once for a set of files/signatures, then try clustering using different values (e.g. max containment, ANI).
- Can we enable easily adding additional files/signatures to the comparisons? If you get an extra genome, you don't need to run all by all again, just this genome x all existing.

Maybe we're already doing this, but q:

When we do all by all comparisons in compare, are we loading each pair of signatures separately for each directional comparison (e.g. when comparing with --containment)? Can we save any time by calculating both directions at once when we have the two relevant files/sigs loaded?

ctb commented 2 years ago

I'm really excited about having a dedicated cluster function!

A couple thoughts:

* Can we enable clustering from pre-calculated distances, e.g. from output of `prefetch`, for example? Or maybe better, can cluster optionally produce output similar to `prefetch` output, with much more information than comes out of `compare`?

This is interesting. We have to think about how to do this well from a UX perspective... could be tricky.

  * I've been using `prefetch` for a couple reasons - first I am minimizing computations by comparing a representative genome --> all genomes, rather than all x all (doesn't help us with cluster), but the second/more important being that we get much more information from `prefetch` -- overlapping bp, jaccard, containment, max containment, etc + soon, estimated ANI. I can imagine wanting to compute all of this information once for a set of files/signatures, then try clustering using different values (e.g. max containment, ANI).

Hrm, ok.

There's only a few distance metrics in there - note that plot and compare don't require that things be distance metrics, and clustering by containment or bp overlap also wouldn't require that the measure be a distance metric, but "proper" clustering would.

(The distance metrics are jaccard, max containment, ANI, and cosine similarity.)

Also note that some of these measures (max containment) can only be calculated for scaled sketches, and cos similarity is the only one that can make use of abundance information. I think it's fine to have sourmash cluster require scaled sketches, tho. Or, we could allow a variety of input file formats including prefetch, perhaps doing something like the picklist argument format.

  * Can we enable easily adding additional files/signatures to the comparisons? If you get an extra genome, you don't need to run all by all again, just this genome x all existing.

A good requirement to keep in mind.

Maybe we're already doing this, but q:

* When we do all by all comparisons in `compare`, are we loading each pair of signatures separately for each directional comparison (e.g. when comparing with `--containment`)? Can we save any time by calculating both directions at once when we have the two relevant files/sigs loaded?

We only load them once, and calculate the direction once, for symmetric measures (e.g. distances).

ctb commented 2 years ago

hot take, I think the sourmash component of this functionality should focus on calculating and outputting hash overlaps (and related metrics) and then large scale clustering should be Somebody Else's Problem and not something that we implement in sourmash directly.

ctb commented 6 months ago

The betterplot plugin supports cluster extraction via dendrogram cuts in the plot2 command. It's nice!

sourmash-bio / sourmash

adding sourmash cluster - some specific thoughts #1814

thoughts on adding sourmash cluster

thresholding etc.

output formats

use cases and CLI