Problem

Currently, the ndim parameter allows a user to specify how many dimensions to cluster off of. The present implementation results in taking the first ndim elements of each data entry such that it is accessed as data[:, :ndim].

If users wanted to specify which columns to use however, they'd have to reorder the data themselves before calling the function.

Proposed Solution

Have ndim accept either an int or a tuple of ints. The former simply says "cluster off of the first ndim dimensions," and the latter says "cluster using these column indices". Then, Numpy's tuple indexing could be leveraged:

ndim = (1, 2, 4)  # sort internally during validation because order matters in Numpy indexing.
data[:, ndim]  # not :ndim

How would we handle the ndim: int case? Simple, generate a tuple of indices from the number:

cluster(data, k=4, ndim=3)
# internal
# ndim is an int, so 
ndim = tuple(range(ndim))  # (0, 1, 2)  which has the same effect as data[:, :ndim].

This solution is excellent in that code that currently passes an int to ndim can continue to do so with no breakage.

Functions to change

Pretty much all of them, but most are minor changes (ex. :ndim -> ndim in indexes).

kmeans.base_funcs: Change accesses; modify _assign_clusters to take in ndim as a parameter rather than generating it.
kmeans.clustering: Change _assign_clusters call; adjust documentation.
kmeans.animate
- change _draw's derivation of x, y, and z.
- Adjust view_clustering.
kmeans_segmentation - No changes.

Additional Notes

Will need to write check to ensure the length of the tuple passed into ndim matches the number of columns in provided initial_means.
Also check to ensure each index is unique (len(ndim) == len(set(ndim))) assuming ndim is a tuple.

tjdwill / kmeans

Feature suggestion: Provide ability to specify which columns you cluster with. #2

Problem

Proposed Solution

Functions to change

Additional Notes