kno10 / rust-kmedoids

k-Medoids clustering in Rust with the FasterPAM algorithm
GNU General Public License v3.0
20 stars 3 forks source link

Add a BanditPAM, BanditPAM++ implementation #2

Closed jianshu93 closed 7 months ago

jianshu93 commented 2 years ago

Dear rust-k-mediods team,

This is amazing implementation and many thanks for this. I noticed an even faster one, which was based on fastPAM but 4 times faster. It is not an exact solution but approximate solution. However, it is as accuracy as the exact solution in practice. Check here: https://proceedings.neurips.cc/paper/2020/file/73b817090081cef1bca77232f4532c5d-Paper.pdf

Any thoughts and potential to also implement this? It would be even useful for very large dataset.

Thanks,

Jianshu

kno10 commented 2 years ago

Its not faster in our experiments. See https://github.com/ThrunGroup/BanditPAM/issues/175

kno10 commented 9 months ago

Because we do not have distance functions, but currently operate on precompupted distance matrixes only, this is currently out-of-scope for this package, but this may, of course, change in the future. Because the available implementation of BanditPAM performed worse for us (despite being already C++), it is not of a high priority.

We would appreciate the contribution of such algorithms as long as they do not add major external dependencies (to avoid dependency hell). I would suggest to first add:

These are quite similar but use uniform sampling, although my impression that FasterPAM will still outperform these if you want high accuracy. I do not think the BadingPAM authors ever compared to these baseline method...

kno10 commented 9 months ago

Note there is a newer version of BanditPAM to be published at NeurIPS 2023. The preprint is here: https://arxiv.org/pdf/2310.18844.pdf

But I am not keen of that paper either, because it

  1. does not compare to relevant baselines including CLARANS and FastCLARANS, or perform any reliable runtime comparison at all in the latest version (possibly because it did not look good to comparing to ours).
  2. contains misclaims, such as "these usually sacrifice clustering quality; such algorithms include CLARA [11], CLARANS [21], and FastPAM [25]. While these algorithms scale subquadratically in n, they return substantially worse solutions than PAM [28]." when FastPAM1 produce the same results as PAM, and FasterPAM has the same fixpoints as PAM (and hence of the same quality). also they claim "We note that Theorem 1 has been observed in alternate forms, e.g., as the “FastPAM1” trick, in prior work [28]. However, to the best of our knowledge, we are the first to provide a formal statement and proof of Theorem 1", when their 2023 proof is essentially Equations 5 and 6 from our 2021 journal version https://doi.org/10.1016/j.is.2021.101804 that was known to them but they did not cite. Even if that still was not formal enough to them, they should have acknowledged this.

Hence I am a bit reluctant to promote their work by implementing it.