[SUGGESTED FEATURE] - Githubissues

sudiptoguha commented 2 years ago

What? Disentangle the clustering function to provide a summary of the centroids (in any suitable format) and approximate (possibly fractional) number of points which are represented by each centroid.

Why? The above output decouples the act of summarization (e.g., there were 5 well separated clusters in data today, yesterday there was 7) from the act of assigning every data point to the nearest centroid. The latter (assignment to nearest centroid) is not useful in many scenarios -- for example when clustering is used to denoise or reduce data.

Further the decoupling allows for "soft clustering", see https://en.wikipedia.org/wiki/Expectation–maximization_algorithm, and the specific shout out to "The on-line textbook: Information Theory, Inference, and Learning Algorithms" http://www.inference.phy.cam.ac.uk/mackay/itila/

Please check Chapter 20, "An Example Inference Task: Clustering".

In fact in a decoupled setup, both uses of summarization and labeling are covered. Most modern clustering algorithms (circa 1990+) would bypass the labeling step to be more scalable. This is specially true of streaming algorithms, core-sets, etc., etc.

ylwu-amzn commented 2 years ago

hi, @sudiptoguha is this the same with https://github.com/opensearch-project/ml-commons/issues/356?

sudiptoguha commented 2 years ago

No. I am asking you to change the API for K-Means (and any future clustering, were that to happen). Clustering is not an example of train and predict -- in particular, because clustering is unsupervised and "training" is a complete misnomer in unsupervised context. Now, it seems you can "adapt" classification output to express clustering; but that does not make clustering a supervised task.

I again recommend the textbook The on-line textbook: Information Theory, Inference, and Learning Algorithms" http://www.inference.phy.cam.ac.uk/mackay/itila/

Please check Chapter 20, "An Example Inference Task: Clustering".

ylwu-amzn commented 2 years ago

Thanks. Agree that "training" maybe not necessary/proper for unsupervised learning. Checked sklearn Kmeans doc https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, seems they provide similar API fit and predict, check their example

>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)

And they provide fit_predict API, we have something similar called train_predict

sudiptoguha commented 2 years ago

I see. If Scikit-learn APIs are the guiding light then has there been any evaluation if Scikit-learn was sufficient? Or parts of it? Does Scikit-learn have KMeans clustering? That saves significant reimplementation and discussions such as "Scikit-learn does it this way".

opensearch-project / ml-commons

[SUGGESTED FEATURE] #358