Closed matteosantama closed 2 years ago
Leads to strange behavior where if you try to predict on one of the centroids, you do not get that centroid label back, ie.
centroid = np.atleast_2d(model.cluster_centroids_[0])
model.predict(centroid, categorical=[2, 4])
>>> array([4], dtype=uint16) # I would expect the 0'th centroid to be labeled 0 here
I tried to capture this in a test, but the test passes consistently:
def test_kprototypes_nclusters_equals_ndata(self):
data = np.array([
[1, 1, 'x', 6, 's'],
[2, 0, 'y', 7, 't'],
[3, 1, 'y', 8, 's'],
[4, 0, 'x', 9, 's'],
[5, 1, 'x', 10, 't'],
])
kproto = kprototypes.KPrototypes(n_clusters=5, init='Cao',
verbose=2, random_state=42)
kproto.fit(data, categorical=[2, 4])
centroids = kproto.cluster_centroids_.copy()
np.testing.assert_array_equal(
centroids[centroids[:, 0].argsort()],
np.array([
[1., 1., 6., 'x', 's'],
[2., 0., 7., 'y', 't'],
[3., 1., 8., 'y', 's'],
[4., 0., 9., 'x', 's'],
[5., 1., 10., 'x', 't'],
])
)
Can you see how our situations differ, @matteosantama ?
Oh, I see the problem: cluster_centroids_
is a concatenation of first the numerical parts and then the categorical parts of the centroids. It does not guarantee the original order of the features, which is what you're assuming.
You could get the original order back if you use the fact that you've got the info you need in categorical=[2, 4]
.
Expected Behavior
If I ask to produce
N
centroids onN
data points, I should have those same data points returned to me.Actual Behavior
The columns and rows of the centroid matrix are permuted.
Steps to Reproduce the Problem
Specifications
0.11.0