intellistream / Sesame

[SIGMOD 2023] Data Stream Clustering: An In-depth Empirical Study
MIT License
17 stars 6 forks source link

fix bugs about G1 #164

Closed tuidan closed 1 year ago

tuidan commented 1 year ago

Three problems:

  1. We run offline clustering(such as kmeans) only on real centers and set the belonging id(start from 0) to every center before sinking. However, when we start sinking the outlier centers, we forget setting the id but just set outlier(bool) true.
  2. For KMeans, when k <2, error occurs.
  3. For evaluation, it is required that the clustering id we set before start from 0, but we set the id of outlier center -1 in groupByCentersWithOffline function and forget setting id for outlier centers in groupByCenters function.

So I include both real and outlier centers for running offline clustering(such as kmeans), and thus it is no longer required to additionally set id for outlier centers. They all start from 0. Besides, after this change, the old groupByCenters can be erased and we can use the same function to group the result centers no matter with or withour offline clusteirng.