tjdwill / kmeans

An implementation of k-means clustering that maintains data association.
https://tjdwill.github.io/kmeans/
MIT License
0 stars 1 forks source link

Feature suggestion: Provide ability to specify which columns you cluster with. #2

Open tjdwill opened 5 months ago

tjdwill commented 5 months ago

Problem

Currently, the ndim parameter allows a user to specify how many dimensions to cluster off of. The present implementation results in taking the first ndim elements of each data entry such that it is accessed as data[:, :ndim].

If users wanted to specify which columns to use however, they'd have to reorder the data themselves before calling the function.

Proposed Solution

Have ndim accept either an int or a tuple of ints. The former simply says "cluster off of the first ndim dimensions," and the latter says "cluster using these column indices". Then, Numpy's tuple indexing could be leveraged:

ndim = (1, 2, 4)  # sort internally during validation because order matters in Numpy indexing.
data[:, ndim]  # not :ndim

How would we handle the ndim: int case? Simple, generate a tuple of indices from the number:

cluster(data, k=4, ndim=3)
# internal
# ndim is an int, so 
ndim = tuple(range(ndim))  # (0, 1, 2)  which has the same effect as data[:, :ndim].

This solution is excellent in that code that currently passes an int to ndim can continue to do so with no breakage.

Functions to change

Pretty much all of them, but most are minor changes (ex. :ndim -> ndim in indexes).

Additional Notes

tjdwill commented 4 months ago

Coming back around to this, I think the proposed changes are viable, but I caution against implementing this change just because I can. I propose developing this in a side branch, and only releasing the patch if it is requested (or if I need it myself).