About the correction step to gamma-imbalanced partition

tginart / deletion-efficient-kmeans

Research prototype of deletion efficient k-means algorithms

BSD 2-Clause "Simplified" License

23 stars 4 forks source link

About the correction step to gamma-imbalanced partition #2

Open thupchnsky opened 2 years ago

thupchnsky commented 2 years ago

Hi @tginart,

Thanks for making your code public. I'm very interested in your work and would like to understand the whole procedure better. However, I have a question about the gamma-imbalanced correction step. How can we guarantee the cluster size is at least self.momentum after applying self._momentum_correction? I could not find proof of this in the paper and it is hard for me to imagine how we can guarantee the cluster size by averaging two centroids by weights. Moreover, the momentum correction step is sequential and the later correction may have influenced previous corrections, which makes the theoretical analysis even harder.

Please let me know if I misunderstand something here. And thanks for your response in advance!

Best, Chao

tginart commented 2 years ago

Thank you for the interest and the question.

The proof that uses the gamma correction is Lemma C.3. It's not that the cluster size is guaranteed, but it's that in any step, the effect of any individual point on the centroid is "capped" via the gamma correction.

Does that make sense? Happy to answer any follow up questions.

thupchnsky commented 2 years ago

Thanks for your quick response! The proof of Lemma C.3. is indeed the place that confuses me a little bit.

Since this is just computing the centroid for one cluster, I guess the first $\frac{1}{n}$ should be $\frac{1}{|\pi\kappa|}$ and the second $\frac{1}{n}$ should be $\frac{1}{|\pi\kappa|-|\pi_\kappa\cap\Delta|}$, where $n$ is the size of the whole dataset? That is why I imagine you would need some bound on the cluster size at each round. Please correct me if I have any misunderstanding here.

I am not sure what "capped" means exactly here. Would appreciate it if you could explain it in a bit more detail or with some expressions. And thanks again for the quick response!

tginart commented 2 years ago

Thank you for bringing this up. After looking back at my notes, I do think that the denominators you've circled are mistakes and should be as you've mentioned.

Actually though, the next few lines in the proof work just fine as if your denominators had been there all along. The L2 bound between any two data points still holds, and therefore the absolute difference is still upper bounded by $\sqrt{d} k /\gamma n$. Is this at least clear in the first case, where $|\pi_\kappa| \geq \gamma n / k$ ?

In the second case, the $\gamma$-correction just averages the new centroid with the previous one in order to decrease the effect of any point on the new centroid.

Please let me know if you have further questions.

thupchnsky commented 2 years ago

Thanks for your response! My question is exactly about the second case. Although the $\gamma$-correction step is intuitive and absolutely makes sense, how can we always guarantee $\pi\kappa$ would become $|\pi\kappa|\geq \gamma n/k$ after the correction? Or there is another way to show in theory that the absolute difference between the centroids before and after removal is upper bounded by $mk\sqrt{d}/\gamma n$, even when $|\pi_\kappa| < \gamma n/k$?

One minor comment: the larger coefficient between $\frac{1}{|\pi\kappa|}$ and $\frac{1}{|\pi\kappa|-|\pi_\kappa\cap \Delta|}$ should be the latter one and it is actually upper bounded by $1/(\gamma n/k - m)$.