scikit-learn-contrib / scikit-learn-extra

scikit-learn contrib estimators
https://scikit-learn-extra.readthedocs.io
BSD 3-Clause "New" or "Revised" License
185 stars 42 forks source link

Contraction Clustering (RASTER): A very fast and parallelizable clustering algorithm #170

Open gregorulm opened 6 months ago

gregorulm commented 6 months ago

Copied from: https://github.com/scikit-learn/scikit-learn/issues/27848 In the comments, it was suggested that scikit-learn-extras may be a better fit for this algorithm.

Describe the workflow you want to enable

RASTER is a very fast clustering algorithm that runs in linear time, uses constant memory, and only requires a single pass. The relevant package is cluster.

Describe your proposed solution

RASTER has been shown to be faster than all other clustering algorithms that are part of the cluster package (see comparative results in the "alternatives" field). A detailed description of the algorithm is in this paper. The key idea is that data points are projected onto a grid. This helper data structure that allows us to cluster data points at the desired level of precision and at a speed much faster than any other clustering algorithm we encountered in the literature. The closest comparison we were made aware of was CLIQUE, but RASTER is more efficient and, in fact, many orders of magnitude faster, which we have also shown experimentally, see Appendix B in the paper above.

Plots with comparisons:

Screen Shot 2023-11-26 at 16 10 16

Example of adjusting the precision parameter:

Screen Shot 2023-11-26 at 16 12 47

Pseudo-code:

Screen Shot 2023-11-26 at 16 06 36

Implementation: https://github.com/FraunhoferChalmersCentre/raster/tree/master

The algorithm is furthermore parallelizable.

Describe alternatives you've considered, if relevant

We compare RASTER to 10 other clustering algorithms, and have found that it outperforms them. RASTER is not only faster, it is also able to process greater amounts of data, ceteris paribus. Here is a summary of the results of our research:

Screen Shot 2023-11-26 at 16 04 57

Additional context

RASTER was discovered in the context of an industrial research project "FUMA - Fleet telematics big data analytics for vehicle Usage Modeling and Analysis" (description, final report), which was administered by the Swedish research agency VINNOVA. It was a collaboration between Scania, a Volkswagen subsidiary, and the Fraunhofer-Chalmers Center for Industrial Mathematics. This research project ran from 2016 to 2019.

A practical result of RASTER was that it made it possible to process TBs of real-world geo-spatial data on a local workstation instead of a data center, leading to significant cost and time-savings. This also implies that we could eliminate security risks as we could keep this highly confidential dataset in-house. In fact, due to the single-pass nature of RASTER, we could generate results locally much faster than we could have gotten them via a data center.

TimotheeMathieu commented 6 months ago

Hello @gregorulm , thanks this seem interesting and could definitely be included into scikit-learn-extra. Could you do a Pull Request from a fork, with a link to this issue. For inclusion, you will need to add tests as well as examples and docstrings for your code, if you are upt for it you can do a PR and we can discuss there.

smruthi-sumanth commented 4 months ago

HI! New to OSS, and really think your proposal is worth adding to sklearn-extra. Can I get the go ahead to start working on this?

gregorulm commented 4 months ago

@smruthi-sumanth I intend to look at the code soon. If I really do not find the time, I will let you know.

gregorulm commented 3 months ago

@smruthi-sumanth Feel free to take over the implementation. You may find these resources helpful: 1) Reference implementations 2) Code artifacts including an integration with the scikit-learn clustering benchmark