microsoft / SPTAG

A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.
MIT License
4.79k stars 582 forks source link

Improve the BalancedDataPartition program #408

Closed suppersam1 closed 9 months ago

suppersam1 commented 9 months ago

My machine only has 64G of memory, and I cannot build a 128G vector data set.The BalancedDataPartition program may solve my problem, as I can divide the 128G data into multiple parts and build them on multiple machines.However, BalancedDataPartition also reads the entire data set into memory and performs clustering, and my 64G memory still cannot meet this condition.I hope BalancedDataPartition program can support multiple batches of reads to complete clustering with small memory. Thank you very much!