nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.23k stars 416 forks source link

K-prototypes return odd results #161

Closed AntanasM5 closed 2 years ago

AntanasM5 commented 2 years ago

Expected Behavior

As an example I used this mixed dataset:

Creating a dictionary with the fake data

dictionary = {"age": [22, 25, 30, 38, 42, 47, 55, 62, 61, 90], "gender": ["M", "M", "F", "F", "F", "M", "M", "M", "M", "M"], "civil_status": ["SINGLE", "SINGLE", "SINGLE", "MARRIED", "MARRIED", "SINGLE", "MARRIED", "DIVORCED", "MARRIED", "DIVORCED"], "salary": [18000, 23000, 27000, 32000, 34000, 20000, 40000, 42000, 25000, 70000], "has_children": [False, False, False, True, True, False, False, False, False, True], "purchaser_type": ["LOW_PURCHASER", "LOW_PURCHASER", "LOW_PURCHASER", "HEAVY_PURCHASER", "HEAVY_PURCHASER", "LOW_PURCHASER", "MEDIUM_PURCHASER", "MEDIUM_PURCHASER", "MEDIUM_PURCHASER", "LOW_PURCHASER"]}

I tried K-Means algorithm with only numerical features and K-Prototypes with numerical + categorical. I expected the results to be different.

Actual Behavior

In both cases I got the same results. I tried with bigger dataset of my own, but I got the same thing, that k-prototypes returns same clusters labels when using categorical+numerical as the K-Means with only numerical ones.

Steps to Reproduce the Problem

1.Just use the KMeans with vars_num = ['age', 'salary'] and KPrototypes with vars_num = ['age', 'salary'] vars_cat = ['gender', 'civil_status', 'has_children', 'purchaser_type'], vars_num+vars_cat and you will same the same output

Specifications

nicodv commented 2 years ago

I tried to reproduce it:

22,M,S,18,False,L
25,M,S,23,False,L
30,F,S,27,False,L
38,F,M,32,True,H
42,F,M,34,True,H
47,M,S,20,False,L
55,M,M,40,False,M
62,M,D,42,False,M
61,M,M,25,False,M
90,M,D,70,True,L
import numpy as np
from kmodes.kmodes import KModes
from kmodes.kprototypes import KPrototypes

X = np.genfromtxt('dummy.csv', dtype=str, delimiter=',')
X[:, 0] = X[:, 0].astype(float)
X[:, 3] = X[:, 3].astype(float)

kmodes = KModes(n_clusters=4, init='Cao', verbose=2)
clusters_kmodes = kmodes.fit_predict(X[:, [0, 3]])

print(clusters_kmodes)
print(kmodes.cluster_centroids_)
print(kmodes.cost_)
print(kmodes.n_iter_)

kproto = KPrototypes(n_clusters=4, init='Cao', verbose=2)
clusters_kproto = kproto.fit_predict(X, categorical=[1, 2, 4, 5])

print(clusters_kproto)
print(kproto.cluster_centroids_)
print(kproto.cost_)
print(kproto.n_iter_)

This gives me:

[0 1 2 3 0 0 0 0 0 0]
[['22.0' '18.0']
 ['25.0' '23.0']
 ['30.0' '27.0']
 ['38.0' '32.0']]
12.0
1

[1 1 1 3 3 3 2 2 2 0]
[['90.0' '70.0' 'M' 'D' 'True' 'L']
 ['25.666666666666668' '22.666666666666668' 'M' 'S' 'False' 'L']
 ['59.333333333333336' '35.666666666666664' 'M' 'M' 'False' 'M']
 ['42.333333333333336' '28.666666666666668' 'F' 'M' 'True' 'H']]
485.8296292303648
2

These are quite different results.

You can use above example to see what you're doing wrong (sorry, have to assume that).