Open m-dz opened 6 years ago
It would certainly be beneficial. The code grew quite a lot in the course of development and has not had a proper overhaul in quite some time. I would prefer a number of smaller refactoring PR requests as I think I will find that easier to review, but I would greatly appreciate any work.
Wow, have tried to refactor KDTreeBoruvkaAlgorithm
and BallTreeBoruvkaAlgorithm
into one common superclass and two subclasses, Cython is not nice when it comes to class inheritance etc. I will continue, maybe with some help of SO.
Also, related to #156
Yeah, Cython is not good with inheritance at all -- hence the original split in two almost identical classes. You can look at what Jake Vanderplas did with binary_tree.pxi
in sklearn's neighbors module for inspiration on how this sort of thing can be done.
Hi, as methods kdtree_min_rdist_dual
and balltree_min_dist_dual
seems to be the biggest difference, especially with different number of parameters, I have two "working" examples of inheritance I would suggest for the KD-Tree and Balltree implementation. Please see here https://github.com/m-dz/test_inheritance
Branch default_values
, using default values for two different versions of Function.evaluate()
(A
and B
) - where A
could represent e.g. the kdtree_min_rdist_dual
and B
the balltree_min_dist_dual
. The biggest drawback is the evaluate
method of class FunctionA
does not raise any exception with the additional argument d
(tests test_A_raises_1
, 2 and 3).
Branch evaluate_overload
, where .pxd
files are used for a "proper" evaluate
method's overloading and a wrong number of arguments actually raises an exception. The drawback here is the extra .pxd
files which make the whole project slightly more complicated.
Do you have any preference here? 2 seems to be a "more correct" approach, but might not be necessary as long as the correct number of arguments is passed, which is a minor, "internal" drawback.
Also, do you have any examples of the performance/speed tests I can use to compare those two versions (and other PR's, like the weighting implementation)?
I think I would prefer option 2. It isn't that complicated to have some pxd files, and doing the right thing matters. As for performance tests -- in the notebooks directory there is a performance notebook comparing to other clustering algorithms. You can just run that (potentially commenting out most of the other algorithms).
Great, will try to "map" all methods to see what can be inherited. My initial guess is we should have 1 base class with two "branches", one for the tree structure (KD-Tree/Balltree) and one for MST algorithm (Prims/Boruvka).
Hi, I have noticed (as @lmcinnes mentioned once or twice) that there is "a bit" of code redundancy here and there, see e.g. definitions of
_hdbscan_boruvka_kdtree
and_hdbscan_boruvka_balltree
in https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/hdbscan_.py which differ in two places (out of ~20 lines of code):Would it be useful to PR some refactoring?