nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.23k stars 416 forks source link

init centroids with ndarray #163

Closed vandommer closed 2 years ago

vandommer commented 2 years ago

Expected Behavior

Able to initialize centroids with ndarray (n centroids, n features) including categorical data

Actual Behavior

Categorical data in the centroids array generates error message : ValueError: invalid literal for int() with base 10: ''

Steps to Reproduce the Problem

att1 = [1,2,3,1,1,2,3,1]
att2 = ['A','B','B','A','C','C','C','C']
data1 = list(zip(att1, att2))
test = pd.DataFrame(data=data1, columns =['A1','A2'])

kmodeTest = KModes(n_clusters=3, init = 'random', n_init = 3, verbose=1).fit(test)

cent = kmodeTest.clustercentroids

kmodeTest2 = KModes(n_clusters=3, init = cent, n_init = 1, verbose=0).fit(test)

Specifications

nicodv commented 2 years ago

When you provide initial clusters with init = ..., you have to provide encoded clusters. Try this instead:

kmodeTest2 = KModes(n_clusters=3, init = kmodeTest._enc_cluster_centroids, n_init = 1, verbose=0).fit(test)

vandommer commented 2 years ago

Thank you so much Nico. That works indeed perfectly. Would be great to put it in the documentation. So now next question ... How can I encode non encoded clusters centroids arrays? I just saved the centroids, non encoded before my Kernel restarted and I lost the model. How can i get it back ? It has a mix of float and strings. Here is how it looks like (7 clusters, 8 attributes) array([['FMAN', 'ASS', '16', '5937', '48.25066176470591', '68.93750000000003', '351', '153', '0'], ['FMAN', 'COMM', '1527', '0', '20.047013888888927', '19.100000000000023', '311', '94', '4'], ['CIB2', 'COMM', '26859', '27', '55.91500000000002', '31.545000000000016', '213', '176', '2'], ['BDM2', 'GEN', '11', '13', '11.268750000000011', '15.640000000000015', '28', '30', '5'], ['CMHC', 'GEN', '26153', '3129', '32.59417958656334', '24.906091954023026', '169', '146', '6'], ['TD', 'BAN', '105049', '45212', '12.746125000000035', '-7.8288181818181215', '201', '158', '3'], ['CIB2', 'FIN', '14565', '5596', '51.026904761904774', '51.026904761904774', '113', '78', '1']], dtype='<U32')

nicodv commented 2 years ago

If you have the saved model, it has the mapping available in the ._enc_map attribute.

Otherwise, you can re-encode using: https://github.com/nicodv/kmodes/blob/master/kmodes/util/__init__.py#L26 That process should be deterministic, but you should be careful nevertheless.

vandommer commented 2 years ago

Thank you so much. I will play with it. It looks like it will perfectly solve my issue. Thanks again

vandommer commented 2 years ago

It works. I have been able to recreate the centroids and the array has been accepted. Hurray! However ... when I fit the model, it still tries iterations and makes some changes. Not many but the final model is not identical to the one I try to replicate. Is there a way to prevent iterations to happen? (n_init = 0 doenst work)