Closed meren closed 8 years ago
The commit e308685a7c1a5a0157ca81112af7bd5bce687ba2 reimplements a critical function, so here is to make sure I didn't screw it up (cat test.py
):
import numpy as np
import anvio.clustering as c
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[20,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
mock_data = np.concatenate((a, b),)
tree = c.get_clustering_as_tree(mock_data)
newick = c.get_tree_object_in_newick(tree, None)
newick_old = c.get_tree_object_in_newick_obsolete(tree, None)
print 'old newick is identical to new:', newick == newick_old
Here is the result:
$ python test.py
old newick is identical to new: True
So far so good. I will keep the old function in the codebase for a while (which is now called get_tree_object_in_newick_obsolete
)
The commit 86f272c8008fc3a676fef148b272bf6bfee9934b reimplement the SciPy version of the hierarchical clustering. Just to make sure things did not get screwed up (cat test.py
):
import numpy as np
import anvio.clustering as c
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[100,])
mock_data = np.concatenate((a, b),)
# with hcluster
hc_tree = c.get_clustering_as_tree_obsolete(mock_data)
hc_newick = c.get_tree_object_in_newick(hc_tree, None)
# with scipy
sp_tree = c.get_clustering_as_tree(mock_data)
sp_newick = c.get_tree_object_in_newick(sp_tree, None)
print 'newick from hcluster is identical to the one from SciPy:', hc_newick == sp_newick
result:
$ python test.py
newick from hcluster is identical to the one from SciPy: True
Good!
The only remaining thing is the performance at this point.
here is test_hc.py
:
import numpy as np
import anvio.clustering as c
np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[2500,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[2500,])
mock_data = np.concatenate((a, b),)
# with hcluster
hc_tree = c.get_clustering_as_tree_obsolete(mock_data)
hc_newick = c.get_tree_object_in_newick(hc_tree, None)
and here is test_sp.py
:
import numpy as np
import anvio.clustering as c
np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[2500,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[2500,])
mock_data = np.concatenate((a, b),)
# with scipy
sp_tree = c.get_clustering_as_tree(mock_data)
sp_newick = c.get_tree_object_in_newick(sp_tree, None)
performance of each:
$ time python test_hc.py
real 1m17.362s
user 1m16.493s
sys 0m0.498s
meren ~ $ time python test_sp.py
real 0m52.174s
user 0m51.358s
sys 0m0.453s
Tried both multiple times with smaller data, it seems numbers make sense.
Not only we got rid of an archaic dependency, but also there is about 30% increase in runtime performance!
We want to get rid of hcluster dependency for a number of reasons. One of the most important reason is the fact that it has not been developed for a number of years, and it stops us from switching to Python 3.x (see #343).
Fortunately SciPy has everything we need. The only problem is to make sure we are backwards compatible with our trees, and more importantly the performance is somewhat comparable.
I will refer to this issue from those commits as I go through the code and make relevant changes towards retiring hcluster.