microsoft / SPTAG

A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.
MIT License
4.77k stars 581 forks source link

KMeans clustering #403

Open ZhenNan2016 opened 9 months ago

ZhenNan2016 commented 9 months ago

Regarding spann, I would like to ask a few questions, as follows:

  1. Regarding KMeans clustering, what is the limit for each cluster center? If it exceeds this limit, will it be re divided into one or multiple layers?
  2. When does the centroid in memory need to be updated after clustering is completed?
  3. After completing clustering, should the new vector data be written directly into posting list in the disk or stored as centroids in memory?
  4. When will KMeans clustering be done again?
  5. There are too many clusters, will they be clustered with KMeans clustering algorithms again?
  6. What is the difference between sptag and sptag++ ?
  7. One question about Hierarchical data partition and partial search, as follows: Does each query require two steps: 1) Distributed dispatch and 2) Local Search? What are the transactions for these two steps? image

Looking forward to your reply. Thanks very much.