smartcorelib / smartcore

A comprehensive library for machine learning and numerical computing. The library provides a set of tools for linear algebra, numerical computing, optimization, and enables a generic, powerful yet still efficient approach to machine learning.
https://smartcorelib.org/
Apache License 2.0
699 stars 76 forks source link

Implement hierarchical clustering #11

Open Mec-iS opened 4 years ago

Mec-iS commented 4 years ago

Motivation: why do we need hierarchical when we have already kmeans?

Vocabulary:

Sub-tasks:

Visualisations: (?)

Other implementations:

VolodymyrOrlov commented 4 years ago

@Mec-iS Take a look at this paper that describes a FastPair algorithm. This algorithm helps to speedup cluster merge operation. Also I suggest to take a look at fastcluster implementation of the HC described in this paper. Figures at the bottom of this page show very well the difference between fastcluster and other implementations. Unfortunately it is written in C++.

Mec-iS commented 4 years ago

Notes

FastPair

fastcluster:

Alternatives:

  1. translate from the Python interface in Rust, then hunt for changes/improvements in the C++ version
  2. FFI to call C++ from Rust, in particular using rustcxx

Background questions:

VolodymyrOrlov commented 4 years ago

Background questions:

  • are we going for a 100% Rust native implementation?
  • are we supposed to allow or not usages of unsafe blocks?

Yes, calling C++ library from SmartCore is not an option for multiple reasons. We'll have to ship fastcluster with SmartCore somehow and it diminishes usefulness of our library.

Do you know C++ by any chance? 😄 If not, feel free to go with any implementation, even if it is not the fastest out there. Another option would be to try to implement the algorithm (it is described here) yourself. It would be super awesome if you can implement fastcluster in Rust, because in this case we will be the only library in Rust that has it.

Mec-iS commented 4 years ago

what role you had in mind for FastPair?

VolodymyrOrlov commented 4 years ago

what role you had in mind for FastPair?

As an alternative to fastcluster

Mec-iS commented 2 years ago

FastPair is implemented #142

We can move on to implement clustering; starting with AgglomerativeClustering

Tasks:

  1. do parameters parsing as in _fit()(we need only some of the required parameters)
  2. implement ward_tree
  3. return _labels

Basic linkage is Ward (that needs euclidean distance).