joshlk / k-means-constrained

K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained
https://github.com/joshlk/k-means-constrained
BSD 3-Clause "New" or "Revised" License
192 stars 43 forks source link

Allow top-level customisation of the distance metric used for clustering #52

Open bibblybobblyben opened 11 months ago

bibblybobblyben commented 11 months ago

The current version of k-means-constrained uses the Euclidean distance to calculate the distance between points in the dataset. Issue #17 highlighted the potential to make this a customisable parameter, and this pull requests introduces the functionality following the advice suggested in that issue.

What's new?

joshlk commented 11 months ago

Hi @bibblybobblyben, thanks for your interest and contribution! 🙂

You will also need to change how the centroids are determined. Currently, a mean of all the data points in a cluster is calculated to get the centroid, but this only makes sense if you are considering Euclidean distances.

The centroids are initially calculated here: https://github.com/joshlk/k-means-constrained/blob/65e22120d37e7c878d91d7fc73cb6380d581ec4f/k_means_constrained/k_means_constrained_.py#L303

And then here: https://github.com/joshlk/k-means-constrained/blob/65e22120d37e7c878d91d7fc73cb6380d581ec4f/k_means_constrained/k_means_constrained_.py#L340

Please also add some tests.

Looking forward to merging this feature 🤩 !

Josh

bibblybobblyben commented 11 months ago

Thanks for the comments and feedback - will work to implement them 🙂

Good point on the centroids - do you think the best course of action is editing the _k_init function from k_means_constrained.sklearn_import.cluster.k_means_ so that it can also take a custom distance metric? Other keyword arguments in init_centroids don't seem affected by the change of distance metric?

bibblybobblyben commented 11 months ago

For the changes involving _centers_dense and _centers_sparse, do I edit the functions in _k_means.pyx ? I am not familiar with using cython.

Can we discuss the motivation to edit those functions please? As I understand it - it looks at all samples that have been pre-allocated to a cluster number, and then calculates the mean coordinate for each point within that cluster. As they are already allocated by the _labels_constrained function, which will incorporate the distance metric, is this calculation already implicitly using the metric? Or do you think we need to do something smarter to define the centre of each cluster? Something to minimise the mean distance to every point in that cluster?