nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.24k stars 417 forks source link

Centroids are mangled #157

Closed matteosantama closed 2 years ago

matteosantama commented 3 years ago

Expected Behavior

If I ask to produce N centroids on N data points, I should have those same data points returned to me.

Actual Behavior

The columns and rows of the centroid matrix are permuted.

Steps to Reproduce the Problem

import numpy as np
import pandas as pd
from kmodes.kprototypes import KPrototypes

def _df_to_numpy(df: pd.DataFrame) -> np.ndarray:
    """Convert to numpy matrix and preserve orientation."""

    def sanitize(s: pd.Series) -> np.array:
        if pd.api.types.is_bool_dtype(s):
            s = s.astype("float32")
        elif pd.api.types.is_categorical_dtype(s):
            s = s.cat.codes.astype("float32")
        elif pd.api.types.is_integer_dtype(s):
            s = s.astype("int32")
        return s.to_numpy()

    return np.column_stack([sanitize(df[x]) for x in df.columns])

df = pd.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [True, False, True, False, True],
    "C": pd.Categorical(["x", "y", "y", "x", "x"]),
    "D": [6, 7, 8, 9, 10],
    "E": pd.Categorical(["s", "t", "s", "s", "t"])
})

matrix = _df_to_numpy(df)
model = KPrototypes(n_clusters=5)
model.fit(matrix, categorical=[2, 4])

print("centroids:\n", model.cluster_centroids_)
print("data_points:\n", matrix)

centroids:
 [[ 4.  0.  0.  9.  0.]
 [ 1.  1.  0.  6.  0.]
 [ 2.  0.  1.  7.  1.]
 [ 3.  1.  1.  8.  0.]
 [ 5.  1.  0. 10.  1.]]
data_points:
 [[ 1.  1.  6.  0.  0.]
 [ 2.  0.  7.  1.  1.]
 [ 3.  1.  8.  1.  0.]
 [ 4.  0.  9.  0.  0.]
 [ 5.  1. 10.  0.  1.]]

Specifications

matteosantama commented 3 years ago

Leads to strange behavior where if you try to predict on one of the centroids, you do not get that centroid label back, ie.

centroid = np.atleast_2d(model.cluster_centroids_[0])
model.predict(centroid, categorical=[2, 4]) 
>>> array([4], dtype=uint16)  # I would expect the 0'th centroid to be labeled 0 here
nicodv commented 3 years ago

I tried to capture this in a test, but the test passes consistently:

    def test_kprototypes_nclusters_equals_ndata(self):
        data = np.array([
            [1, 1, 'x', 6, 's'],
            [2, 0, 'y', 7, 't'],
            [3, 1, 'y', 8, 's'],
            [4, 0, 'x', 9, 's'],
            [5, 1, 'x', 10, 't'],
        ])
        kproto = kprototypes.KPrototypes(n_clusters=5, init='Cao',
                                         verbose=2, random_state=42)
        kproto.fit(data, categorical=[2, 4])
        centroids = kproto.cluster_centroids_.copy()
        np.testing.assert_array_equal(
            centroids[centroids[:, 0].argsort()],
            np.array([
                [1., 1., 6., 'x', 's'],
                [2., 0., 7., 'y', 't'],
                [3., 1., 8., 'y', 's'],
                [4., 0., 9., 'x', 's'],
                [5., 1., 10., 'x', 't'],
            ])
        )

Can you see how our situations differ, @matteosantama ?

nicodv commented 3 years ago

Oh, I see the problem: cluster_centroids_ is a concatenation of first the numerical parts and then the categorical parts of the centroids. It does not guarantee the original order of the features, which is what you're assuming.

You could get the original order back if you use the fact that you've got the info you need in categorical=[2, 4].