merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Retiring hcluster and removing the dependency for it #372

Closed meren closed 8 years ago

meren commented 8 years ago

We want to get rid of hcluster dependency for a number of reasons. One of the most important reason is the fact that it has not been developed for a number of years, and it stops us from switching to Python 3.x (see #343).

Fortunately SciPy has everything we need. The only problem is to make sure we are backwards compatible with our trees, and more importantly the performance is somewhat comparable.

I will refer to this issue from those commits as I go through the code and make relevant changes towards retiring hcluster.

meren commented 8 years ago

The commit e308685a7c1a5a0157ca81112af7bd5bce687ba2 reimplements a critical function, so here is to make sure I didn't screw it up (cat test.py):

import numpy as np
import anvio.clustering as c

a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[20,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
mock_data = np.concatenate((a, b),)

tree = c.get_clustering_as_tree(mock_data)

newick = c.get_tree_object_in_newick(tree, None)
newick_old = c.get_tree_object_in_newick_obsolete(tree, None)

print 'old newick is identical to new:', newick == newick_old

Here is the result:

$ python test.py
old newick is identical to new: True

So far so good. I will keep the old function in the codebase for a while (which is now called get_tree_object_in_newick_obsolete)

meren commented 8 years ago

The commit 86f272c8008fc3a676fef148b272bf6bfee9934b reimplement the SciPy version of the hierarchical clustering. Just to make sure things did not get screwed up (cat test.py):

import numpy as np
import anvio.clustering as c

a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[100,])
mock_data = np.concatenate((a, b),)

# with hcluster
hc_tree = c.get_clustering_as_tree_obsolete(mock_data)
hc_newick = c.get_tree_object_in_newick(hc_tree, None)

# with scipy
sp_tree = c.get_clustering_as_tree(mock_data)
sp_newick = c.get_tree_object_in_newick(sp_tree, None)

print 'newick from hcluster is identical to the one from SciPy:', hc_newick == sp_newick

result:

$ python test.py
newick from hcluster is identical to the one from SciPy: True

Good!

The only remaining thing is the performance at this point.

meren commented 8 years ago

here is test_hc.py:

import numpy as np
import anvio.clustering as c

np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[2500,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[2500,])
mock_data = np.concatenate((a, b),)

# with hcluster
hc_tree = c.get_clustering_as_tree_obsolete(mock_data)
hc_newick = c.get_tree_object_in_newick(hc_tree, None)

and here is test_sp.py:

import numpy as np
import anvio.clustering as c

np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[2500,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[2500,])
mock_data = np.concatenate((a, b),)

# with scipy
sp_tree = c.get_clustering_as_tree(mock_data)
sp_newick = c.get_tree_object_in_newick(sp_tree, None)

performance of each:

$ time python test_hc.py

real    1m17.362s
user    1m16.493s
sys 0m0.498s

meren ~ $ time python test_sp.py

real    0m52.174s
user    0m51.358s
sys 0m0.453s

Tried both multiple times with smaller data, it seems numbers make sense.

Not only we got rid of an archaic dependency, but also there is about 30% increase in runtime performance!