kgori / treeCl

Clustering phylogenetic trees with python
MIT License
24 stars 12 forks source link

AssertionError from Spectral c'tor #10

Closed hdetering closed 6 years ago

hdetering commented 7 years ago

Hi kgoryi,

when initializing a treeCl.Spectral object, I get an assertion error:

/home/harry/.local/lib/python2.7/site-packages/treeCl/clustering.pyc in __init__(self, dm, pruning_option, scale_option, manual_pruning, manual_scale, verbosity)
    154         self._manual_scale = manual_scale
    155         self._verbosity = verbosity
--> 156         self._affinity = self.decompose()
    157 
    158     def __str__(self):

/home/harry/.local/lib/python2.7/site-packages/treeCl/clustering.pyc in decompose(self, noise, verbosity, logic, **kwargs)
    215                 print('Rescaling to avoid zero-div error')
    216             _, _, scale = binsearch_mask(matrix, logic=logic)
--> 217             assert (scale > 1e-5).all()
    218 
    219         aff = affinity(matrix, mask, scale)

AssertionError: 

It seems that the rescaling of the distance matrix (which contains zeros) is not producing the expected result. I'll attach you a pickle'd version of the distance matrix used: example.dm.pkl.zip Distances were calculated using the 'geo' method.

A minimal working example to reproduce the error would be:

import treeCl, pickle

with open('example.dm.pkl', 'rb') as f:
  dm = pickle.load(f)

spclust = treeCl.Spectral(dm)

What seems to be the problem here? Are the input trees to similar (how could this be bad)? How can the problem be avoided?

Thanks a lot and cheers -- Harry

kgori commented 7 years ago

Hi Harry, I looked at your data and it seems there's a bug in treeCl. The Spectral object is supposed to automatically adjust a density parameter to fit your distance matrix, to improve clustering and protect against divide-by-zero errors. With your data this automatic adjustment isn't working - looks like the wrong binary search routine is being called.

There is a workaround. If you construct the Spectral object like this:

spclust = treeCl.Spectral(scale_option=treeCl.clustering.options.LOCAL_SCALE_MANUAL,
                          manual_scale=245)

you get a working object. If you're wondering where the 245 comes from, it's the result of calling the correct binary search routine.

Hope this helps for now. I'll work on making a proper fix for the next release.

hdetering commented 7 years ago

Thanks for your reply, kgori!

Your workaround works for the distance matrix I uploaded in the initial post but seems to be specific to that data. For other matrices (e.g. example1.dm.pkl.zip) I still get the same error. But you probably knew this already.

Since I have some more data sets to run I will go with the MDS clustering which works without problems (according to your paper they perform equally well).

Anyway, thanks a lot for your effort! -- Harry

alexweisberg commented 7 years ago

Hi kgori,

I also get this same error with my dataset (using geodesic distances and Spectral clustering). This dataset has a lot of missing data and a few potentially very incongruent tree topologies.

I have also found that when I do MDS clustering on this dataset I get a similar result to your manuscript's Supplementary Figure 12, ie two groups strongly separated by PC1 in a vertical line. This clustering found by MDS is not consistent with the partitions found using Ward's method (I cannot test Spectral since it gives the same assert error as Harry).

Any suggestions?

best, Alex

rohanmaddamsetti commented 7 years ago

Hi Dr. Gori,

I also get this same assertion error with my data, using geodesic distances and spectral clustering. I have ~4000 gene trees, so unfortunately I cannot attached a zipped pickle of my distance matrix here.

Cheers, Rohan