nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.24k stars 417 forks source link

Enhance random initialization for K-means part of K-prototypes #116

Open regorsmitz opened 5 years ago

regorsmitz commented 5 years ago

Thanks @nicodv for your response to my previous question about failed KPrototype initialization, and for building this library, which I have found very helpful!

Now I see that your KMeans implementation uses points selected from normal distribution to initialize—sorry for my previous confusion. That being said, I don’t think that the current behavior is appropriate to all use cases, and for example in my case, it is important that the initialization always succeeds, because I’d ideally like to be able to use this job as part of a production pipeline. I think random initialization of K means is a standard thing, and if n_init is set high enough, it should be reasonably accurate depending on the dataset.

I would just select a random set of points from my dataset to explicitly pass to the K Means initialization, but (correct me if I’m wrong but) it seems that this approach does not allow one to take advantage of n_init > 1, which makes random initialization much more likely to be suboptimal.

Thanks for reading and sorry to be filling this repo with issues. If you want me to put in a PR for this change, I can give it a shot (adding something like init=‘all-random’ to KPrototypes only, which randomly initializes the K Means component n_init times).

nicodv commented 5 years ago

I've followed the papers by Huang (https://github.com/nicodv/kmodes#huang98), which do the sampling from a normal distribution..

Feel free to make a PR for this. It makes sense to open up the initialization of the k-means part of k-prototypes to enhancements. We'd have init_num and init_cat arguments to k-prototypes, I'd imagine.

In the meantime, you can do the sampling yourself and re-run k-prototypes each time with the chosen points as the initialization points. You're right, it's not supported out of the box.