sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
452 stars 78 forks source link

sourmash compare matrix plot matplotlib labels too large/overlapping #2587

Open peterjc opened 1 year ago

peterjc commented 1 year ago

Running sourmash plot --pdf --labels example.npy with ~200 signatures gives plots where the labels are too large and therefore overlap.

Looking at https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/fig.py it does not appear to alter the matplotlib default font sizes, but resources like https://stackoverflow.com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plot suggests we might reduce the font size and/or increase the image size for larger datasets.

Is this a bug, or would your recommendation be to follow https://sourmash.readthedocs.io/en/latest/plotting-compare.html#Customizing-plots and customise the plot by writing a modified version of the sourmash/fig.py code?

ctb commented 1 year ago

sourmash plot could certainly use some love! It was one of the first things we implemented ~6 years ago, and (FBFW) has driven a lot of our citations... but we haven't upgraded it, ever. This was due to some combination of:

This is all me saying that it's never risen to the level of "gotta fix" but has definitely risen to the level of "hmmmm yeah we should really be doing something about that."

A few related thoughts and issues -

the R package, sourmashconsumr

sourmashconsumr https://github.com/sourmash-bio/sourmash/issues/2492 is an R package that has some nice viz:

Screenshot 2023-04-24 at 6 16 23 AM

sourmash plot isn't doing the right thing, I think

per https://github.com/sourmash-bio/sourmash/issues/2406, I appear to have mixed up my similarity and distance matrices.

better label handling, plot annotation, etc

per https://github.com/sourmash-bio/sourmash/issues/2452, there are some good opportunities to make editing label names better (since I intuit that is a lot of what people want to do)

per https://github.com/sourmash-bio/sourmash/issues/2583 there are lots of opportunities to annotate dendrograms with more information

plugins are now a thing

per https://github.com/sourmash-bio/sourmash/issues/1353 and https://github.com/sourmash-bio/sourmash/pull/2438 in particular it would now be straightforward to experiment with other clustering and viz techniques all from within the relative safety of the sourmash command line.

this would permit the addition of dependencies that we don't want to add to core sourmash (for size and/or platform/install and/or support reasons) to support better output viz.


this is all to say... we just need someone who cares, or at least pointers to some good plots from other packages that we can steal ;). I know this is an active area, I just don't have a starting point!

peterjc commented 1 year ago

That all makes sense. One size fits all visualisation defaults are not easy.

ctb commented 1 year ago

additional thoughts -

ctb commented 1 year ago

more from slack:

Christopher Gulvik Fig 1c minimum spanning tree style in GrapeTree rocks by [@jcarrico] and [@happykhan] . I've grown to appreciate it more and more for a broader audience than heirclust or phytrees to show outbreak or cluster data (SNPs, ANI, or cgMLST). The software that currently makes that style here has end of life this year.

Screenshot 2023-04-26 at 5 59 05 AM
ctb commented 1 month ago

The betterplot plugin would be a good place to add custom plotting code for very large plots.