Resource intensity - Githubissues

carl-offerfit commented 1 year ago

Hi,

It seems quite resource intensive compared to regular k-means. I'm working on a dataset with 23,000 examples and 50 features right now and it took almost 15 minutes to find 10 clusters; I set the minimum to be 0.5 the mean size cluster size (1150) and the maximum to be 2x the mean cluster size (4600). Does that seem reasonable? (possibly something wrong with my setup?) Those don't seem like very unreasonable constraints. But I would like to go larger, maybe 1M examples and 150 features. Maybe up to 20 or 50 clusters. Do you have any advice on this? How many iterations is reasonable to use? How big a dataset is reasonable to use if we would like it to complete in less than 15 minutes? Do you have experience to share as far as finding cluster centers from a sub-sample instead of the full sample? Or any papers or other types of references you recommend on these questions.

Thanks, any advice helps! Carl

joshlk commented 1 year ago

Hi Carl,

Nice to meet you 👋 .

The k-means-constrained algorithm is more complex than vanilla k-means and therefore will take longer to execute and has worse scaling characteristics.

Essentially the runtime scales as $\mathcal{O}(n^4\log(n)))$ whereby $n$ is the number of data points and the number of clusters are assumed to be of a similar order. I have added further details and a benchmark to the readme.

23,000 examples is a lot for this algorithm, so I am not surprised.

I recommend partitioning the data before clustering to reduce the number of data points and clusters. For example, if your data is geographical you can partition the data into regions; or if you are dealing with people you can partition the data by age and gender.

I hope that helps, Josh

joshlk commented 1 year ago

I am closing this thread due to inactivity. Feel free to re-open if you have further questions

joshlk / k-means-constrained

Resource intensity #47