nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.24k stars 417 forks source link

Add error & short circuit if Pandas dataframe is passed into fit_predict instead of NumPy array #114

Closed regorsmitz closed 5 years ago

regorsmitz commented 5 years ago

I'm running the following code:

prototypeClustering = KPrototypes(n_clusters=10, 
                                  init='Cao', 
                                  verbose=100)
prototypeClustering.fit_predict(X, categorical=[i for i in range(categorical_df.columns.values.size)])

Since I have verbose mode on, I can see the moves per iteration, and I have noticed that with the configuration above, the training fails as soon as there is an iteration where there are 0 moves. To reproduce this issue, I recommend using a small dataset with a high number of clusters, so there is a high probability of an iteration with 0 moves.

Output / stack trace:

Initialization method and algorithm are deterministic. Setting n_init to 1. Init: initializing centroids Init: initializing clusters Init: initializing centroids Init: initializing clusters Starting iterations... Run: 1, iteration: 1/100, moves: 2640, ncost: 906024701442.8253 Run: 1, iteration: 2/100, moves: 935, ncost: 863644943798.5979 Run: 1, iteration: 3/100, moves: 557, ncost: 844366144404.3018 Run: 1, iteration: 4/100, moves: 398, ncost: 829619773050.4286 Run: 1, iteration: 5/100, moves: 325, ncost: 818463604224.1627 Run: 1, iteration: 6/100, moves: 256, ncost: 813235837011.3778 Run: 1, iteration: 7/100, moves: 165, ncost: 811553263961.7179 Run: 1, iteration: 8/100, moves: 130, ncost: 810452778360.2623 Run: 1, iteration: 9/100, moves: 126, ncost: 809493708178.163 Run: 1, iteration: 10/100, moves: 81, ncost: 808941359440.6614 Run: 1, iteration: 11/100, moves: 62, ncost: 808673546931.4755 Run: 1, iteration: 12/100, moves: 45, ncost: 808447845407.1216 Run: 1, iteration: 13/100, moves: 38, ncost: 808307752250.539 Run: 1, iteration: 14/100, moves: 24, ncost: 808243120277.072 Run: 1, iteration: 15/100, moves: 22, ncost: 808210883455.4402 Run: 1, iteration: 16/100, moves: 9, ncost: 808201300381.1038 Run: 1, iteration: 17/100, moves: 11, ncost: 808189508679.0436 Run: 1, iteration: 18/100, moves: 12, ncost: 808171886835.0874 Run: 1, iteration: 19/100, moves: 25, ncost: 808121825481.3004 Run: 1, iteration: 20/100, moves: 38, ncost: 808020403165.9956 Run: 1, iteration: 21/100, moves: 35, ncost: 807951740463.9619 Run: 1, iteration: 22/100, moves: 25, ncost: 807914200232.4612 Run: 1, iteration: 23/100, moves: 19, ncost: 807840929538.2213 Run: 1, iteration: 24/100, moves: 16, ncost: 807795774926.7335 Run: 1, iteration: 25/100, moves: 29, ncost: 807755854677.4387 Run: 1, iteration: 26/100, moves: 7, ncost: 807752426872.5327 Run: 1, iteration: 27/100, moves: 1, ncost: 807752358570.4669 Run: 1, iteration: 28/100, moves: 0, ncost: 807752358570.4669

TypeError Traceback (most recent call last)

in 2 init='Cao', 3 verbose=100) ----> 4 prototypeClustering.fit_predict(X, categorical=[i for i in range(categorical_df.columns.values.size)]) 5 #prototypeClustering.fit_predict(X, categorical=[i for i in range(10)]) ~/data_analytics/lib/python3.6/site-packages/kmodes/kmodes.py in fit_predict(self, X, y, **kwargs) 374 predict(X). 375 """ --> 376 return self.fit(X, **kwargs).predict(X, **kwargs) 377 378 def predict(self, X, **kwargs): ~/data_analytics/lib/python3.6/site-packages/kmodes/kprototypes.py in predict(self, X, categorical) 436 assert hasattr(self, '_enc_cluster_centroids'), "Model not yet fitted." 437 --> 438 Xnum, Xcat = _split_num_cat(X, categorical) 439 Xnum, Xcat = check_array(Xnum), check_array(Xcat, dtype=None) 440 Xcat, _ = encode_features(Xcat, enc_map=self._enc_map) ~/data_analytics/lib/python3.6/site-packages/kmodes/kprototypes.py in _split_num_cat(X, categorical) 42 :param categorical: Indices of categorical columns 43 """ ---> 44 Xnum = np.asanyarray(X[:, [ii for ii in range(X.shape[1]) 45 if ii not in categorical]]).astype(np.float64) 46 Xcat = np.asanyarray(X[:, categorical]) ~/data_analytics/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key) 2925 if self.columns.nlevels > 1: 2926 return self._getitem_multilevel(key) -> 2927 indexer = self.columns.get_loc(key) 2928 if is_integer(indexer): 2929 indexer = [indexer] ~/data_analytics/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2655 'backfill or nearest lookups') 2656 try: -> 2657 return self._engine.get_loc(key) 2658 except KeyError: 2659 return self._engine.get_loc(self._maybe_cast_indexer(key)) pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() TypeError: '(slice(None, None, None), [11, 12, 13, 14, 15, 16, 17])' is an invalid key
regorsmitz commented 5 years ago

More fiddling revealed that this issue occurs if you pass a Pandas dataframe instead of a NumPy array as input to fit_predict. The documentation says to pass a NumPy array so this is my mistake, but anyway I'd imagine other people might try passing a dataframe in since they are used to other prediction functions properly handling dataframes. As an improvement, it might be nice to have logic to check if a dataframe has been passed in and short circuit, rather than running and appearing to be broken internally.

nicodv commented 5 years ago

This one feels similar to: https://github.com/nicodv/kmodes/issues/67

Have you tried this with the latest Github version?

nicodv commented 5 years ago

This should be fixed now on master after merging https://github.com/nicodv/kmodes/pull/117, courtesy of @Genie-Liu .

I'm considering making a 0.10.1 patch release for this, as it seems a common problem.