nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.23k stars 416 forks source link

IndexError when manually decoding categorical part of centroids #158

Closed Eugenia77 closed 2 years ago

Eugenia77 commented 3 years ago

Hello,

I am actually trying to run kpototypes on a mix of ordinal, nominal and continuous data (the continuous are very few), using "chi-square" or "ng_dissim" for categorical dissimarity (number of clusters=2) and then calculate silhouette score separately for the kMeans part and the kModes "parts" of the algorithm. It all seems to work perfectly fine until the decoding part kicks in, where decoder_centroids() fails with the following message

IndexError Traceback (most recent call last)

in 8 kmodesCentroids = centroidsList[1] 9 print(kmodesCentroids) ---> 10 kmodesCentroidsDecoded = decode_centroids(kmodesCentroids, kmodes_enc_map) 11 kmeansCentroids = centroidsList[0] 12 #kmeansCentroidsReverseTransformed = BoxCoxReverseTransform(kmeansCentroids, bcLambdaList) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\kmodes\util\__init__.py in decode_centroids(encoded, mapping) 57 """ 58 decoded = [] ---> 59 for ii in range(encoded.shape[1]): 60 # Invert the mapping so that we can decode. 61 inv_mapping = {v: k for k, v in mapping[ii].items()} IndexError: tuple index out of range The kModesCentroids that the function is trying to process looks like this: [4.07539289e+02 1.96867152e+00 1.15256036e-01 4.14546146e+00 6.52170321e+00 3.29984704e+00 1.20504957e+00 2.70561578e-01 9.68008237e-01 2.24801237e+00 4.06081078e-01 1.58366314e+00 1.22267842e+00 8.21319560e-01 1.40041311e+00 1.34779471e+01 9.22667433e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] and the encoding as follows: [{0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5}, {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3}, {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5, 6.0: 6}, {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5}, {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5}, {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5}, {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5}] Hope the way I have described the problem makes some sense to you and you may be able to assist. Any feedback, suggestions , comments will be greatly appreciated! Thank you in advance, Eugenia
nicodv commented 3 years ago

So kmodesCentroids is supposed to be the categorical part of the centroids, and you're getting it from the trained model using model._enc_cluster_centroids, right?

That variable should hold a matrix of k (number of centroids) by n (number of categorical variables). It seems that for you it's 1-dimensional, which looks wrong to me.