nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.23k stars 416 forks source link

Cannot handle NaN input #162

Closed PascalEconomouCoupa closed 2 years ago

PascalEconomouCoupa commented 2 years ago

Expected Behavior

I expected the kmodes algorithm to be able to handle missing values (np.nan), as described in the README.

Actual Behavior

I get an error when the input matrix X has a missing value.

Steps to Reproduce the Problem

import numpy as np
from kmodes.kmodes import KModes
km = KModes(n_clusters=2, init='Huang', n_init = 1, verbose=1)
X = np.array([[np.nan, 1], [0, 1], [0, 0]])
km.fit_predict(X)

Output:

~/.local/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Specifications

nicodv commented 2 years ago

Looks like sklearn has update their data validation. Since kmodes tries to follow the sklearn interface and behavior where possible, I'm hereby essentially dropping support for NaNs.

I will update the documentation accordingly.