nglm / pycvi

Internal Cluster Validity Indices in python, compatible with time-series data
MIT License
4 stars 0 forks source link

Davies bouldin index returning inf due to self distance computation between cluster centers #9

Closed benjaminh closed 1 month ago

benjaminh commented 1 month ago

When trying to compute Davies Bouldin score on time series clustering results, I ended up with infinite value.

After some investigation, I found that distance between centroids returns 0 when computing distance between a centroid and itself (here)

Yet, when computing the final score, no distinction seems to be made on these values. So davies_bouldin() function returns inf because it looks for the max value between any distances and an infinite value.

And this infinite value is always included, no matter what because of the first condition above.

See : https://github.com/nglm/pycvi/blob/9a356a554e4ed23cc64ed5f4b49cb18e60d92005/pycvi/cvi_func.py#L1007

This line's purpose is unclear to me : why adding infinite value when distance is 0, is it possible to have two different clusters with a null distance between their centroids ? Otherwise, distance between a cluster center and itself should just be ignored, right ? Is it something specific to the Davies Bouldin score implementation ? Does anyone encountered the same problem ?

I may be wrong but what about changing this line to :

if i != j else 0

I feel like it would just ignore the cases where distance have been computed between a cluster center and itself ? (and so that it doesn't impact the _dist_between_centroids() function that may be used for other purposes/metrics)

What do you think ?

nglm commented 1 month ago

Hi @benjaminh ! Thank you again for sharing your comments!

I think you found something in the code that is actually not consistent with the original paper as the paper does specify that i should be different from j when computing the max value (which indeed makes senses because, as you pointed out, it would otherwise always pick this infinite value).

However, I am not sure your suggested solution would be "loyal" to the original paper [^DB], as two clusters could have the exact same centroid while being actually very well separated (imagine 2 concentric circles of different radius). And, in this case, Davies Bouldin index would actually completely fail to assess those clusters as well separated and will then give them a bad "score" (inf). As of now, I don't think it is PyCVI's task to "fix" the original index as in this case, it was intended that clusters with identical centroids would give a bad Davies-Boulding score instead of being ignored and pretend that there is nothing wrong in the given clustering.

I just updated the master branch with the following code instead:

# Compute R_ijs even when i=j
R_ijs = [
    [
        (S_is[i] + S_is[j]) / dist_between_centroids[i][j]
        if dist_between_centroids[i][j] != 0 else np.inf
        for j in range(k)
    ] for i in range(k)
]

# Remove the case i == j as described in the paper
R_ijs = [
    [
        R_ijs[i][j] for j in range(k) if i != j
    ] for i in range(k)
]

# Compute R_is = max_j of R_ijs when i != j
R_is = [
    np.amax(R_ijs) for i in range(k)
]

Doing it in two steps allows us to distinguish between cases where dist_between_centroids[i][j] == 0 because i == j (which we want to ignore as the original paper suggests) and cases where dist_between_centroids[i][j] == 0 with i != j but because the centroids of the clusters are actually the same (which we want to penalize in terms of Davies-Bouldin results).

What do you think of this solution? Does it still answer your concerns?

[^DB]: D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, pp. 224–227, Apr. 1979.

benjaminh commented 1 month ago

Hello @nglm ,

thank you for your quick feedback. You're right, I missed the case when Davies-Bouldin should indeed return np.inf if two different clusters share the same center (like in your illustrating example).

Your solution seems allright to me and does answer my concerns ! Thanks again :)