milesgranger / gap_statistic

Dynamically get the suggested clusters in the data for unsupervised learning.
The Unlicense
217 stars 46 forks source link

OptimalK _calculate_gap method passes incorrect data to clusterer? #33

Closed johnvorsten closed 5 years ago

johnvorsten commented 5 years ago

In line 181-183 of optimalK.py, the _calculate_gap() method finds the dispersion measure of the observed data-set. centroids, labels = self.clusterer(random_data, n_clusters, *self.clusterer_kwargs ) # type: Tuple[np.ndarray, np.ndarray]

However, it looks like this line passes the null data-set instead of the observed data-set. The centroids and labels of this call are then passed to the _calculate_dispersion() method using the observed data-set. This means the dispersion metric is using the incorrect centroids and labels, which gives a larger dispersion metric for the observed set.

I changed this line to pass the observed data instead of the null data and this is the result compared to an implementation I did (labeled Gap2) :).

Reference_Gap

milesgranger commented 5 years ago

Wow, that does seem wrong. I checked the Rust side of things and it does it correctly. On vacation right now, but if you submit a PR I'll merge it in!

shaypal5 commented 5 years ago

Omg, this might explain the weird behavior I've been seeing on my data. I've been trying to find some bug in the code for a couple of days now! Great catch, @johnvorsten !

shaypal5 commented 5 years ago

OMG THAT WAS IT!!!! THANKS!!