joshlk / k-means-constrained

K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained
https://github.com/joshlk/k-means-constrained
BSD 3-Clause "New" or "Revised" License
192 stars 43 forks source link

Constrained K-Means not implemented to sparse Matrix #16

Closed ericjardimx closed 3 years ago

ericjardimx commented 3 years ago

I'm trying to apply a Constrained K-means to my data and I get this error "NotImplementedError: Not implemented for sparse X"

Originally I have a dataframe with 132034 rows of title. I convert them to a list then apply a tf-idf fit_transform to it.

Then it converts to a 132034x17693 sparse matrix of type '<class 'numpy.float64'>' with 694509 stored elements in Compressed Sparse Row format>.

Then, I try to apply the model

true_k = 25 smin = 5300 smax = 13200 model = KMeansConstrained(n_clusters=true_k, size_min=smin, size_max=smax, random_state=0, init='k-means++', n_init=10, max_iter=1000)

True_k was defined based on Elbow Method using common k-means. smin is based on hypothesis over the sample smax is also based on hypothesis over the sample.

But I get those error and can't get through. There's no problem at all running the usual K-means. I have 128 GB RAM memory, so, it's not also a lack of processing power.

joshlk commented 3 years ago

The error is what it says on the tin: you can’t use a sparse input.

You need to convert the sparse input to a normal (dense) format. Usually you can do this by X.todense() if it’s a scipy sparse array.

ericjardimx commented 3 years ago

The error is what it says on the tin: you can’t use a sparse input.

You need to convert the sparse input to a normal (dense) format. Usually you can do this by X.todense() if it’s a scipy sparse array.

Ok, done this. It took 16 hours with 100 iterations. The clusters were all evenly distributed, almost the same number by cluster., and the assigned minimum.

The native scikit K-means gave me very uneven distributed clusters. I know that k-means should have approximate number of individuals in each cluster, but in comparison to the original, the constrained shouldn't be something in between? not so unevenly distributed, but also not so even?

sklearn.cluster k-means, 100 iterations:

1 | 1764 2 | 872 3 | 2019 4 | 5183 5 | 1956 6 | 1388 7 | 1588 8 | 2241 9 | 3476 10 | 2017 11 | 869 12 | 3238 13 | 3637 14 | 2970 15 | 1362 16 | 4002 17 | 1894 18 | 5300 19 | 2672 20 | 3289 21 | 2353 22 | 68407 23 | 2752 24 | 1349 25 | 5436

k_means_constrained KMeansConstrained, 100 iterations :

min_clus = 5280 max_clus=13202

1 | 5280 2 | 5280 3 | 5280 4 | 5280 5 | 5280 6 | 5304 7 | 5280 8 | 5280 9 | 5280 10 | 5280 11 | 5280 12 | 5280 13 | 5280 14 | 5280 15 | 5280 16 | 5280 17 | 5280 18 | 5280 19 | 5280 20 | 5280 21 | 5280 22 | 5280 23 | 5280 24 | 5280 25 | 5280

joshlk commented 3 years ago

The cluster distribution depends on the data distribution, there are no guarantees for either normal k-means or constrained k-means. The input data drives the output cluster distribution.

In practice from experience, k-means will usual have a power law distribution (which your data above appears to have) and a constrained k-means distribution will be more uniform.

As your k-means-constrained output cluster distribution is completed union this indicates you need a lower min cluster size and possibly more clusters to better represent your data. But this decision is dependent on your use-case.