Open Quicken-up opened 7 years ago
I believe @taylorreiter has run into this with large data sets. There's no good simple solution AFAIK; the scip cluster hierarchy code just doesn't like so many samples! A solution might be to output the comparison matrix with the --csv output for 'compare' in the latest sourmash master and import it into a program that handles large clusters better.
(This problem is serious and is part of the motivation behind issues:
https://github.com/dib-lab/sourmash/issues/256
https://github.com/dib-lab/sourmash/issues/225
but we don't have a solution yet!)
from @luizirber when I ran in to this problem:
you can change the recursion depth limit with sys.setrecursionlimit
: https://docs.python.org/3/library/sys.html#sys.setrecursionlimit
I did what @ctb suggested and output the matrix as a csv and used R to make a dendrogram without the heatmap.
Quick and dirty R code:
install.packages("fastcluster")
library(fastcluster)
compk4<-read.csv("Oe6_scaffolds_k4.comp.csv")
rownames(compk4)<-colnames(compk4)
cluster_compk4<-hclust(dist(compk4), "cen")
compk4_clusters<-hclust(dist(compk4))
dend <- as.dendrogram(compk4_clusters)
plot(dend)
Maybe it's time to add a dependency on fastcluster and change the plot code to use it? (It's the same library @taylorreiter suggested in her R solution)
sounds like something worth exploring for sure! concerned about adding more dependencies tho.
Thanks for the suggestions. I will try the csv option, and and also try running compare on a reduced subset of good bins. Should it run OK on 200 bins? 500? Is there a good reason to use recursion in the code rather than iteration? That seems to be the root of the problem.
We're using the scipy.cluster.hierarchy package as a black box, so you'd have to ask them why recursion :). I've personally plotted 300x300 on my laptop without any trouble.
The question of what approach to include in the sourmash package itself is an interesting one - so far we've chosen something that is straightforward to install and community supported, but we haven't put a lot of thought into it (or at least I haven't). Your experience is valuable in suggesting that we choose something else more scalable as a default.
But hey, at least we finally let you export CSV!
I have run sourmash compare on 2683 signature files each corresponding to a single bin from a large metagenomic dataset. When I then try to plot the output using sourmash plot --labels cmp, I get the error below. Any suggestions on fixing this?
Traceback (most recent call last): File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/bin/sourmash", line 11, in
load_entry_point('sourmash==2.0.0a1', 'console_scripts', 'sourmash')()
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/main.py", line 60, in main
cmd(sys.argv[2:])
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/commands.py", line 395, in plot
Z1 = sch.dendrogram(Y, orientation='right', labels=labeltext)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2365, in dendrogram
above_threshold_color=above_threshold_color)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2651, in _dendrogram_calculate_info
above_threshold_color=above_threshold_color)
<Line 2651 error repeated many times>
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2618, in _dendrogram_calculate_info above_threshold_color=above_threshold_color) File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2530, in _dendrogram_calculate_info leaf_label_func, i, labels) File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2387, in _append_singleton_leaf_node lvs.append(int(i)) RecursionError: maximum recursion depth exceeded while calling a Python object