Closed johnvorsten closed 5 years ago
Wow, that does seem wrong. I checked the Rust side of things and it does it correctly. On vacation right now, but if you submit a PR I'll merge it in!
Omg, this might explain the weird behavior I've been seeing on my data. I've been trying to find some bug in the code for a couple of days now! Great catch, @johnvorsten !
OMG THAT WAS IT!!!! THANKS!!
In line 181-183 of optimalK.py, the
_calculate_gap()
method finds the dispersion measure of the observed data-set.centroids, labels = self.clusterer(random_data, n_clusters, *self.clusterer_kwargs ) # type: Tuple[np.ndarray, np.ndarray]
However, it looks like this line passes the null data-set instead of the observed data-set. The centroids and labels of this call are then passed to the
_calculate_dispersion()
method using the observed data-set. This means the dispersion metric is using the incorrect centroids and labels, which gives a larger dispersion metric for the observed set.I changed this line to pass the observed data instead of the null data and this is the result compared to an implementation I did (labeled Gap2) :).