joshlk / k-means-constrained

K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained
https://github.com/joshlk/k-means-constrained
BSD 3-Clause "New" or "Revised" License
192 stars 43 forks source link

Usage for size_max < sample size #25

Closed ajdajd closed 2 years ago

ajdajd commented 2 years ago

Describe the bug Not a bug but a question. Fitting KMeansConstrained with X.shape[0] < size_max throws ValueError: size_min and size_max must be a positive number smaller than the number of data points or None, which I understand. However, in my case, this may be violated without any consequence to the output. See MWE below.

Minimum working example

X = np.array([[0, 0, 0], [0, 0, 0]])
clst = KMeansConstrained(
    n_clusters=1,
    size_min=1,
    size_max=3,
)
clst.fit(X)

In this case, it should fit a single cluster. Not the biggest of deals as I could implement a try/except or pre-check input array shape and size_max similar to the source code to bypass the ValueError. I am just wondering if this is an edge case.

Some context: In the analysis I am trying to run, I am running

n = math.ceil(X.shape[0] / 3)
clst = KMeansConstrained(n_clusters=n, size_min=1, size_max=3)
clst.fit(X)

over different Xs -- most of which are >100 samples apart for some few odd ones that have 1-2 samples in them. Again, I could work my way around it, just wondering about the size_max < sample size check.

Thanks for the great library!

joshlk commented 2 years ago

Hey, I see what you mean by it not changing the output but I consider the arguments to be wrong. I think it's better to notify the user than silently ignore it.

You can easily work around it by doing:

X = np.array([[0, 0, 0], [0, 0, 0]])
clst = KMeansConstrained(
    n_clusters=1,
    size_min=1,
    size_max=min(3, len(X)),
)
clst.fit(X)
ajdajd commented 2 years ago

Thanks for the guidance!