joshlk / k-means-constrained

K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained
https://github.com/joshlk/k-means-constrained
BSD 3-Clause "New" or "Revised" License
192 stars 43 forks source link

Updating to have predict_untrained function #43

Closed AdityaSavara closed 11 months ago

AdityaSavara commented 1 year ago

The existing code does not allow training on a large set of points and then predicting which clusters new points would be assigned to. The new function allows predicting the cluster centers of new points that are outside of the original training.

If I'm not mistaken, the scikitlearn does this automatically with the predict function. However, I can see why your code does not do that, so an additional function seems reasonable. Here, "predict_untrained" is the name chosen, since I think that is reasonably clear.

Edit: I decided I liked predict_untrained and changed the function name to that on my end.

joshlk commented 1 year ago

Hi, thanks for your interest in k-means-constrained! 😁

Can you please elaborate on what your trying to do as the current predict method does allow you to assign unseen data to the clusters?

AdityaSavara commented 1 year ago

Thanks for checking what I'm trying to do! Unless I am mistaken, with KmeansConstrained, there is no function that will allow prediction without refitting. Your current example still does the refitting (I think). For my application (using a conventional personal computer) that fitting can take >1 hour. So it's useful to have a prediction function that is separate from the fitting, after initial training. From the source code, I did not see anything like that in KmeansConstrained. I think that in the regular kmeans from sklearn, the predict function works similarly to what I made, though perhaps more efficiently. I made a separate function because I didn't want to interfere with your code.

joshlk commented 1 year ago

Hi,

The KMeansConstrained.predict method does just that. It assigns the new data-points to the old centroids (cluster centres) - it doesn't re-fit the centroids. Here is an example:

from k_means_constrained import KMeansConstrained
import numpy as np

# Fir clf to `X1` data to create centroids
X1 = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
clf = KMeansConstrained(
     n_clusters=2,
     size_min=2,
     size_max=5,
     random_state=0
 )
clf.fit_predict(X1)
centroids = clf.cluster_centers_[:]

# Assign new unseen data (X2) to centroids fitted with X1
X2 = np.array([[6, 7], [8, 9], [10, 11], [12, 13], [14, 15], [16, 17]])
clf.predict(X2)

# Assert that centroids (cluster centres) haven't changed
assert (clf.cluster_centers_ == centroids).all()

So by using predict you don't refit the model. Is this what you mean by refitting?

(Sorry my example in the previous post wasn't helpful so I've removed it)

AdityaSavara commented 1 year ago

In my case, I was "fitting" on 50,000 points, with min and max of 200 and 500 for example. Then tried to predict a couple of points.

When I tried my example, I found a few things:

(1) These lines of code gave errors:

    # Check size min and max
    if not ((size_min >= 0) and (size_min <= n_samples)
            and (size_max >= 0) and (size_max <= n_samples)):
        raise ValueError("size_min and size_max must be a positive number smaller "
                         "than the number of data points or `None`")
    if size_max < size_min:
        raise ValueError("size_max must be larger than size_min")
    if size_min * n_clusters > n_samples:
        raise ValueError("The product of size_min and n_clusters cannot exceed the number of samples (X)")

My recollection is I tried leaving the min and max as None as well as feeding them, and that neither worked, but I could be remembering wrong. So I commented out those lines of code.

(2) Then this line of code gave me trouble as well: labels, inertia = \ _labels_constrained(X, self.clustercenters, size_min, size_max, distances=distances)

That's why I tried to make an additional function. I'm sorry that I don't have a minimal example to provide, my example is in the middle of some other code. I suppose it wouldn't be very hard to make a minimal example with randomly populated arrays, but unfortunately I'm very behind and have moved to the next stage of what I'm working on. I might be open to making a minimal example after I'm ready for my deadline(s). Also it is possible that I made some kind of user error.

AdityaSavara commented 1 year ago

Ok, I have a sort-of minimal example but I'm using a dill pickled version of the cluster object because it took 5 hours to train.

To run this: If you use pip install PEUQSE[complete] you will get the dill pickle dependencies. Alternatively, you could probably install dill and change the code in clustering.py to undill pickle the object without using the PEUQSE function to do so, but that would be more work than just installing PEUQSE[complete] .

Attached is the dill file and a run file, along with my modified k_meansconstrained.py file. For my own records... I have made this in a directory called 14b_minimal on my computer. minimal_example.zip

If you run clustering.py right now, the predict_untrained function being used will work. If you try to change that to use the regular "predict" (near the bottom of the cluster.py file), you will get an error. I have shown this by putting a call to predict at the bottom. I was unable to edit the predict function in a way to avoid the error. I thought about introducing some complicated logic into predict, but decided it was better to just make a new function (at least as an intermediate-term solution).

I should clarify that before running this, you need to add my custom function into your k_means constrained python file, otherwise you will obviously get an error that the object doesn't have any function named predict_untrained!

joshlk commented 1 year ago

Hi @AdityaSavara,

Sorry but for me to be able to diagnose the issue you will need to make a minimal work example. This should only be a few lines of code and a very small number of data points.

Thanks, Josh

AdityaSavara commented 1 year ago

No problem! It will just take me some time to get to it since I am overburdened at the moment, but I will be happy to do so once I get past some deadlines! (Maybe a month as a best guess)

joshlk commented 11 months ago

I am closing this due to inactivity. Feel free to reopen it if you wish