Open bibblybobblyben opened 11 months ago
Hi @bibblybobblyben, thanks for your interest and contribution! 🙂
You will also need to change how the centroids are determined. Currently, a mean of all the data points in a cluster is calculated to get the centroid, but this only makes sense if you are considering Euclidean distances.
The centroids are initially calculated here: https://github.com/joshlk/k-means-constrained/blob/65e22120d37e7c878d91d7fc73cb6380d581ec4f/k_means_constrained/k_means_constrained_.py#L303
Please also add some tests.
Looking forward to merging this feature 🤩 !
Josh
Thanks for the comments and feedback - will work to implement them 🙂
Good point on the centroids - do you think the best course of action is editing the _k_init
function from k_means_constrained.sklearn_import.cluster.k_means_
so that it can also take a custom distance metric? Other keyword arguments in init_centroids
don't seem affected by the change of distance metric?
For the changes involving _centers_dense
and _centers_sparse
, do I edit the functions in _k_means.pyx
? I am not familiar with using cython.
Can we discuss the motivation to edit those functions please? As I understand it - it looks at all samples that have been pre-allocated to a cluster number, and then calculates the mean coordinate for each point within that cluster. As they are already allocated by the _labels_constrained
function, which will incorporate the distance metric, is this calculation already implicitly using the metric? Or do you think we need to do something smarter to define the centre of each cluster? Something to minimise the mean distance to every point in that cluster?
The current version of k-means-constrained uses the Euclidean distance to calculate the distance between points in the dataset. Issue #17 highlighted the potential to make this a customisable parameter, and this pull requests introduces the functionality following the advice suggested in that issue.
What's new?
euclidean_distances
is no longer hard-coded into the clustering calculationsdistance_metric
argument which can be passed to the top-level clustering class and propagates through to the optimisation routineeuclidean_distance
function but enables users to change this to any other function of their choosing