nicodv / kmodes

Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data
MIT License
1.23k stars 416 forks source link

Estimation of Gamma in K-Prototypes #185

Closed crixus5678 closed 1 year ago

crixus5678 commented 1 year ago

For the estimation of gamma in k-prototypes, the current implementation appears to estimate the gamma by using 0.5 * (standard deviation of all numeric data).

However, in the paper [Huang 1997] it was mentioned that gamma is guided by the " average standard deviation of numeric attributes". If that is the case, shouldn't we be calculating the mean for all the standard deviation of each numeric attribute?

nicodv commented 1 year ago

For reference, from the paper:

Generally speaking, γ_l is related to σ_l , the average standard deviation of numeric attributes in cluster l. In practice, σ_l can be used as a guidance to determine γ_l . However, since σ_l is unknown before clustering, the overall average standard deviation σ of numeric attributes can be used for all σ_l.

So yes, it appears you are correct in your statement.