nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.23k stars 416 forks source link

Reloading a pretrained model #150

Closed arainboldt closed 3 years ago

arainboldt commented 3 years ago

Hi,

Great work on this library. I really appreciate the implementation. I'm having trouble, however, re-instantiating a pretrained KPrototypes model from saved centroids.

When I input the centroids tuple into KPrototypes via the 'init' argument and then call fit, the model goes through the normal fitting procedure despite the centroids being deterministic. Additionally, the centroids which are learned, although close to those provided, are not identical, and the predictions on the same data are not identical.

Note that I'm using the same random_state value & n_clusters. The pretrained model and the reloaded model are initialized identically with the exception of the 'init' argument, which is a list of np.arrays in the latter case.

I would expect that calling 'fit' on a model where the centroids are provided via 'init' would simply need to generate '_enc_cluster_centroids' and '_enc_map' to allow for prediction.

Consequently, I feel that I'm missing something. What am I doing wrong?

Thanks in advance for the help,

Andrew

nicodv commented 3 years ago

The recommended way to store and save models is using Python's pickle (or your own preferred binary format), as shown in the tests here: https://github.com/nicodv/kmodes/blob/master/kmodes/tests/test_kprototypes.py#L39

init is meant for retraining a model from a good starting point, which could be useful, for example, for retraining a model on new but similar data.

arainboldt commented 3 years ago

Thanks for your quick response and for clarifying.