sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

use alternative clustering package in sourmash plot, to support larger data sets #274

Open Quicken-up opened 7 years ago

Quicken-up commented 7 years ago

I have run sourmash compare on 2683 signature files each corresponding to a single bin from a large metagenomic dataset. When I then try to plot the output using sourmash plot --labels cmp, I get the error below. Any suggestions on fixing this?

Traceback (most recent call last): File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/bin/sourmash", line 11, in load_entry_point('sourmash==2.0.0a1', 'console_scripts', 'sourmash')() File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/main.py", line 60, in main cmd(sys.argv[2:]) File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/commands.py", line 395, in plot Z1 = sch.dendrogram(Y, orientation='right', labels=labeltext) File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2365, in dendrogram above_threshold_color=above_threshold_color) File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2651, in _dendrogram_calculate_info above_threshold_color=above_threshold_color)

<Line 2651 error repeated many times>

File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2618, in _dendrogram_calculate_info above_threshold_color=above_threshold_color) File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2530, in _dendrogram_calculate_info leaf_label_func, i, labels) File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2387, in _append_singleton_leaf_node lvs.append(int(i)) RecursionError: maximum recursion depth exceeded while calling a Python object

ctb commented 7 years ago

I believe @taylorreiter has run into this with large data sets. There's no good simple solution AFAIK; the scip cluster hierarchy code just doesn't like so many samples! A solution might be to output the comparison matrix with the --csv output for 'compare' in the latest sourmash master and import it into a program that handles large clusters better.

(This problem is serious and is part of the motivation behind issues:

https://github.com/dib-lab/sourmash/issues/256

https://github.com/dib-lab/sourmash/issues/225

but we don't have a solution yet!)

taylorreiter commented 7 years ago

from @luizirber when I ran in to this problem: you can change the recursion depth limit with sys.setrecursionlimit: https://docs.python.org/3/library/sys.html#sys.setrecursionlimit

I did what @ctb suggested and output the matrix as a csv and used R to make a dendrogram without the heatmap.

Quick and dirty R code:

install.packages("fastcluster")
library(fastcluster)
compk4<-read.csv("Oe6_scaffolds_k4.comp.csv")
rownames(compk4)<-colnames(compk4)
cluster_compk4<-hclust(dist(compk4), "cen")
compk4_clusters<-hclust(dist(compk4))
dend <- as.dendrogram(compk4_clusters)
plot(dend)
luizirber commented 7 years ago

Maybe it's time to add a dependency on fastcluster and change the plot code to use it? (It's the same library @taylorreiter suggested in her R solution)

ctb commented 7 years ago

sounds like something worth exploring for sure! concerned about adding more dependencies tho.

Quicken-up commented 7 years ago

Thanks for the suggestions. I will try the csv option, and and also try running compare on a reduced subset of good bins. Should it run OK on 200 bins? 500? Is there a good reason to use recursion in the code rather than iteration? That seems to be the root of the problem.

ctb commented 7 years ago

We're using the scipy.cluster.hierarchy package as a black box, so you'd have to ask them why recursion :). I've personally plotted 300x300 on my laptop without any trouble.

The question of what approach to include in the sourmash package itself is an interesting one - so far we've chosen something that is straightforward to install and community supported, but we haven't put a lot of thought into it (or at least I haven't). Your experience is valuable in suggesting that we choose something else more scalable as a default.

But hey, at least we finally let you export CSV!