nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.23k stars 416 forks source link

What are the minimum characteristics that a binary matrix must meet to avoid the following error: "Insufficient Number of data since union is 0"? #178

Closed Galy88 closed 2 years ago

Galy88 commented 2 years ago

Expected Behavior

Given a binary matrix, for example, of a 3x4 size, the KModes algorithm is expected to run and find to which cluster each row of the matrix should belong.

Actual Behavior

Currently we have a 3x4 binary matrix, when we run the KModes algorithm, we immediately get the following message: "Insufficient Number of data since union is 0". With 3x4 binary matrixes sometimes the algorithm works.

Steps to Reproduce the Problem

  1. Import libraries:

    import numpy as np
    from kmodes.kmodes import KModes
    from kmodes.util.dissim import jaccard_dissim_binary
    1. Create binary matrix:
      m = np.array([[0, 1, 1, 1],
            [0, 1, 1, 0],
            [1, 1, 1, 0]])
  2. Adjust kmodes km = KModes(n_clusters=2, init='cao', random_state=0, n_jobs=-1, cat_dissim=jaccard_dissim_binary)

  3. Fit predict clusters = km.fit_predict(m)

with this matrix the algorithm works:

m2 = np.array([[1, 0, 0, 1], 
               [0, 1, 1, 0],
               [0, 1, 1, 0]])

clusters = km.fit_predict(m2)

Specifications

nicodv commented 2 years ago

This appears to occur due to the jaccard_dissim_binary dissimilarity function, in combination with some specifics of the data set. Specifically, it appears this distance metric does not support situations where you have binary data, and all rows have the same value for a column.

You'll either need to: