sourmash is a tool for biological sequence analysis and comparisons.
betterplot
is a sourmash plugin that provides improved plotting/viz
and cluster examination for sourmash-based sketch comparisons. It
includes better similarity matrix plotting, MDS plots, and
clustermaps, as well as support for coloring samples based on
categories. It also includes support for sparse comparison output
formats produced by the fast multithreaded manysearch
and pairwise
functions in the
branchwater plugin for sourmash.
sourmash compare
and
sourmash plot
produce basic distance matrix plots that are useful for comparing and
visualizing the relationships between dozens to hundreds of
genomes. And this is one of the most popular use cases for sourmash!
However, the visualization can be improved a lot beyond the basic viz
that sourmash plot
produces. There are a lot of only slightly more
complicated use cases for comparing, clustering, and visualizing many
genomes!
And this plugin exists to explore some of these use cases!
General goals:
and who knows what else??
As of v0.4, the betterplot plugin provides:
manysearch
;pairwise
output into a similarity matrix;cluster
output into color categories;pip install sourmash_plugin_betterplot
See the examples below for some example command lines and output,
and use command-line help (-h/--help
) to see available options.
labels-to
CSV file.The labels-to
CSV file taken by most (all?) of the comparison matrix
plotting functions (e.g. plot2
, plot3
, mds
) is the same format
output by
sourmash compare ... --labels-to <file>
and loaded by sourmash plot --labels-from <file>
. The format is
hopefully obvious, but there are a few things to mention -
sort_order
column specifies the order of the columns with respect
to the samples in the distance matrix. This is there to support arbitrary
re-arranging and processing of the CSV file.label
column is the name that will be displayed on the plot, as well as
for the default "categories" CSV matching (see below). You can edit this
by hand (spreadsheet, text editor) or programmatically.labels.txt
file output by sourmash compare
is entirely ignored ;).One of the nice features of the betterplot functions is the ability to
provide categories that color the plots. This is critical for some
plots - for example, the mds
and mds2
plots don't make much sense
without colors! - and nice for other plots, like plot3
and
clustermap1
, where you can color columns/rows by category.
To make use of this feature, you need to provide a "categories" CSV
file (typically -C/--categories-csv
). This file is reasonably flexible
in format; it must contain at least two columns, one named category
,
but can contain more as long as category
is provided.
The simplest possible categories CSV format is shown in
10sketches-categories.csv, and
it contains two columns, label
and category
. When this file is
loaded, label
is matched to the name of each point/row/column, and
that point is then assigned that category.
Additional flexibility is provided by the column matching.
Some restrictions of / observations on the current implementation:
category
column. This won't be picked up by the
code automatically - you'll need to specify the same file via -C
-
but it works fine!The command lines below are executable in the examples/
subdirectory
of the repository after installing the plugin.
plot2
- basic 3 sketches exampleCompare 3 sketches with sourmash compare
, and cluster.
This command:
sourmash compare sketches/{2,47,63}.sig.zip -o 3sketches.cmp
--labels-to 3sketches.cmp.labels_to.csv
sourmash scripts plot2 3sketches.cmp 3sketches.cmp.labels_to.csv \
-o examples/plot2.3sketches.cmp.png
produces this plot:
plot2
- 3 sketches example with a cut line: plot2 --cut-point 1.2Compare 3 sketches with sourmash compare
, cluster, and show a cut point.
This command:
sourmash compare sketches/{2,47,63}.sig.zip -o 3sketches.cmp
--labels-to 3sketches.cmp.labels_to.csv
sourmash scripts plot2 3sketches.cmp 3sketches.cmp.labels_to_csv \
-o examples/plot2.cut.3sketches.cmp.png \
--cut-point=1.2
produces this plot:
plot2
- dendrogram of 10 sketches with a cut line + cluster extractionCompare 10 sketches with sourmash compare
, cluster, and use a cut
point to extract multiple clusters. Use --dendrogram-only
to plot
just the dendrogram.
This command:
sourmash compare sketches/{2,47,48,49,51,52,53,59,60}.sig.zip \
-o 10sketches.cmp \
--labels-to 10sketches.cmp.labels_to.csv
sourmash scripts plot2 10sketches.cmp 10sketches.cmp.labels_to.csv \
-o plot2.cut.dendro.10sketches.cmp.png \
--cut-point=1.35 --cluster-out --dendrogram-only
produces this plot:
as well as a set of 6 clusters to 10sketches.cmp.*.csv
.
mds
- multidimensional Scaling (MDS) from sourmash compare
outputUse MDS to display a comparison generated by sourmash compare
.
These commands:
sourmash compare sketches/{2,47,48,49,51,52,53,59,60}.sig.zip \
-o 10sketches.cmp \
--labels-to 10sketches.cmp.labels_to.csv
sourmash scripts mds 10sketches.cmp 10sketches.cmp.labels_to.csv \
-o mds.10sketches.cmp.png \
-C 10sketches-categories.csv
produces this plot:
mds2
- multidimensional Scaling (MDS) plot from pairwise
outputUse MDS to display a sparse comparison created using the
branchwater plugin's
pairwise
command. The output of pairwise
is distinct from the
sourmash compare
output: pairwise
produces a sparse CSV file that
contains just the matches above threshold, while sourmash compare
produces a dense numpy matrix.
These commands:
sourmash sig cat sketches/{2,47,48,49,51,52,53,59,60}.sig.zip \
-o 10sketches.sig.zip
sourmash scripts pairwise 10sketches.sig.zip -o 10sketches.pairwise.csv
sourmash scripts mds 10sketches.cmp \
-o mds.10sketches.cmp.png \
-C 10sketches-categories.csv
produces this plot:
cluster_to_categories
- convert clusters from cluster
into categoriesThe sourmash scripts cluster
command from
the branchwater plugin
will cluster pairwise
output; cluster_to_categories
converts these clusters
into a categories CSV that can be used to color points and columns/rows.
These commands:
# generate pairwise comparison
sourmash scripts pairwise 64sketches.sig.zip -o 64sketches.pairwise.csv \
--write-all
# generate clusters
sourmash scripts cluster 64sketches.pairwise.csv \
-o 64sketches.pairwise.clusters.csv \
--similarity jaccard -t 0
# convert to categories CSV
sourmash scripts cluster_to_categories 64sketches.pairwise.csv \
64sketches.pairwise.clusters.csv -o 64sketches.pairwise.clusters.cats.csv
produce 64sketches.pairwise.clusters.cats.csv
, which categorizes the
input samples based on their cluster membership.
tsne
- tSNE plot of comparisons from sourmash compare
outputt-distributed stochastic neighbor embedding (t-SNE) is another method
for visualizing high-dimensional data in two dimensions. The tsne
command displays a comparison generated by sourmash compare
.
These commands:
sourmash compare 64sketches.sig.zip -o 64sketches.cmp \
--labels-to 64sketches.cmp.labels_to.csv
sourmash scripts tsne 64sketches.cmp 64sketches.cmp.labels_to.csv \
-C 64sketches.pairwise.clusters.cats.csv -o tsne.64sketches.cmp.png
produce this plot:
(The 64sketches.pairwise.clusters.cats.csv
is generated by the
cluster_to_categories
command above.)
tsne2
- tSNE plot of comparisons from pairwise
output.These commands:
sourmash scripts pairwise 64sketches.sig.zip -o 64sketches.pairwise.csv \
--write-all
sourmash scripts tsne2 64sketches.pairwise.csv \
-C 64sketches.pairwise.clusters.cats.csv -o tsne2.64sketches.cmp.png
produce this plot:
(The 64sketches.pairwise.clusters.cats.csv
is generated by the
cluster_to_categories
command above.)
pairwise_to_matrix
- convert pairwise
output to sourmash compare
output and plotConvert the sparse comparison CSV (created using the
branchwater plugin's pairwise
command) into a sourmash compare
-style similarity matrix.
These commands:
# build pairwise
sourmash sig cat sketches/{2,47,48,49,51,52,53,59,60}.sig.zip \
-o 10sketches.sig.zip
sourmash scripts pairwise 10sketches.sig.zip -o 10sketches.pairwise.csv
# convert pairwise
sourmash scripts pairwise_to_matrix 10sketches.pairwise.csv \
-o 10sketches.pairwise.cmp --write-all \
--labels-to 10sketches.pairwise.cmp.labels_to.csv
# plot!
sourmash scripts plot2 10sketches.pairwise.cmp \
10sketches.pairwise.cmp.labels_to.csv \
-o plot2.pairwise.10sketches.cmp.png
produce this plot:
plot3
- seaborn clustermap with color categoriesPlot a sourmash compare
similarity matrix using the
seaborn
clustermap, which
offers some nice visualization options.
These commands:
sourmash compare sketches/{2,47,48,49,51,52,53,59,60}.sig.zip \
-o 10sketches.cmp \
--labels-to 10sketches.cmp.labels_to.csv
sourmash scripts plot3 10sketches.cmp 10sketches.cmp.labels_to.csv \
-o plot3.10sketches.cmp.png -C 10sketches-categories.csv
produce this plot:
clustermap1
- seaborn clustermap for non-symmetric matricesPlot the sparse comparison CSV (created using the
branchwater plugin's manysearch
command) using seaborn's clustermap. Supports separate
category coloring on rows and columns.
These commands:
sourmash sig cat sketches/{2,47,48,49,51,52,53,59,60}.sig.zip \
-o 10sketches.sig.zip
sourmash scripts manysearch 10sketches.sig.zip \
sketches/shew21.sig.zip -o 10sketches.manysearch.csv
sourmash scripts clustermap1 10sketches.manysearch.csv \
-o clustermap1.10sketches.png
-u containment -R 10sketches-categories.csv
produce:
upset
- plot sketch intersections using UpSetPlotPlot an UpSetPlot of the intersections between sketches.
This command:
sourmash scripts upset 10sketches.sig.zip -o 10sketches.upset.png
produces:
We suggest filing issues in the main sourmash issue tracker as that receives more attention!
betterplot
is developed at
https://github.com/sourmash-bio/sourmash_plugin_betterplot.
See environment.yml
for the dependencies needed to develop betterplot
.
Run:
make examples
to run the examples.
For now, the examples serve as the tests; eventually we will add unit tests.
Bump version number in pyproject.toml
and push.
Make a new release on github.
Then pull, and:
python -m build
followed by twine upload dist/...
.
CTB June 2024