scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

Refactoring? #155

Open m-dz opened 6 years ago

m-dz commented 6 years ago

Hi, I have noticed (as @lmcinnes mentioned once or twice) that there is "a bit" of code redundancy here and there, see e.g. definitions of _hdbscan_boruvka_kdtree and _hdbscan_boruvka_balltree in https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/hdbscan_.py which differ in two places (out of ~20 lines of code):

    tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
    alg = KDTreeBoruvkaAlgorithm(tree, sample_weights, min_samples, metric=metric,
                                 leaf_size=leaf_size // 3,
                                 approx_min_span_tree=approx_min_span_tree,
                                 n_jobs=core_dist_n_jobs, **kwargs)
    tree = BallTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
    alg = BallTreeBoruvkaAlgorithm(tree, sample_weights, min_samples, metric=metric,
                                   leaf_size=leaf_size // 3,
                                   approx_min_span_tree=approx_min_span_tree,
                                   n_jobs=core_dist_n_jobs, **kwargs)

Would it be useful to PR some refactoring?

lmcinnes commented 6 years ago

It would certainly be beneficial. The code grew quite a lot in the course of development and has not had a proper overhaul in quite some time. I would prefer a number of smaller refactoring PR requests as I think I will find that easier to review, but I would greatly appreciate any work.

m-dz commented 6 years ago

Wow, have tried to refactor KDTreeBoruvkaAlgorithm and BallTreeBoruvkaAlgorithm into one common superclass and two subclasses, Cython is not nice when it comes to class inheritance etc. I will continue, maybe with some help of SO.

Also, related to #156

lmcinnes commented 6 years ago

Yeah, Cython is not good with inheritance at all -- hence the original split in two almost identical classes. You can look at what Jake Vanderplas did with binary_tree.pxi in sklearn's neighbors module for inspiration on how this sort of thing can be done.

m-dz commented 6 years ago

Hi, as methods kdtree_min_rdist_dual and balltree_min_dist_dual seems to be the biggest difference, especially with different number of parameters, I have two "working" examples of inheritance I would suggest for the KD-Tree and Balltree implementation. Please see here https://github.com/m-dz/test_inheritance

  1. Branch default_values, using default values for two different versions of Function.evaluate() (A and B) - where A could represent e.g. the kdtree_min_rdist_dual and B the balltree_min_dist_dual. The biggest drawback is the evaluate method of class FunctionA does not raise any exception with the additional argument d (tests test_A_raises_1, 2 and 3).

  2. Branch evaluate_overload, where .pxd files are used for a "proper" evaluate method's overloading and a wrong number of arguments actually raises an exception. The drawback here is the extra .pxd files which make the whole project slightly more complicated.

Do you have any preference here? 2 seems to be a "more correct" approach, but might not be necessary as long as the correct number of arguments is passed, which is a minor, "internal" drawback.

Also, do you have any examples of the performance/speed tests I can use to compare those two versions (and other PR's, like the weighting implementation)?

lmcinnes commented 6 years ago

I think I would prefer option 2. It isn't that complicated to have some pxd files, and doing the right thing matters. As for performance tests -- in the notebooks directory there is a performance notebook comparing to other clustering algorithms. You can just run that (potentially commenting out most of the other algorithms).

m-dz commented 6 years ago

Great, will try to "map" all methods to see what can be inherited. My initial guess is we should have 1 base class with two "branches", one for the tree structure (KD-Tree/Balltree) and one for MST algorithm (Prims/Boruvka).