Open skeptrunedev opened 8 months ago
There's not any public issues (you're the first person to leave an issue!). If you're interested in clustering datasets larger-than-memory, we actually already support minibatch-k-means, which iteratively clusters based on chunks of data. I have no idea how good a job that does or whether it solves your problem though, so it would be awesome if you wanted to give that a try and let us know how it goes.
There's not any public issues (you're the first person to leave an issue!). If you're interested in clustering datasets larger-than-memory, we actually already support minibatch-k-means, which iteratively clusters based on chunks of data. I have no idea how good a job that does or whether it solves your problem though, so it would be awesome if you wanted to give that a try and let us know how it goes.
Yeah actually, I am very game. Will likely run that sometime this week. I think I have to write a bit of code on top to work w/ the qdrant
scroll endpoint
This project generally seems quite helpful. Honestly, I'm most interested in the clustering, we are fairly happy with deduplication system as is. It seems like for this to work as is you need enough memory to hold all your vectors at once. Then, from there, can run the alrogithm.
Most of our customer vector datasets are >80GB in size so we would need some way to cluster them in a paginated method. It would be cool to contribute that, but I wanted to see if there was maybe already an issue for it or something adjacent?