taylorai / galactic

data cleaning and curation for unstructured text
Apache License 2.0
320 stars 14 forks source link

question: I am interested to contribute, is there a public issue backlog? #18

Open skeptrunedev opened 8 months ago

skeptrunedev commented 8 months ago

This project generally seems quite helpful. Honestly, I'm most interested in the clustering, we are fairly happy with deduplication system as is. It seems like for this to work as is you need enough memory to hold all your vectors at once. Then, from there, can run the alrogithm.

Most of our customer vector datasets are >80GB in size so we would need some way to cluster them in a paginated method. It would be cool to contribute that, but I wanted to see if there was maybe already an issue for it or something adjacent?

andersonbcdefg commented 8 months ago

There's not any public issues (you're the first person to leave an issue!). If you're interested in clustering datasets larger-than-memory, we actually already support minibatch-k-means, which iteratively clusters based on chunks of data. I have no idea how good a job that does or whether it solves your problem though, so it would be awesome if you wanted to give that a try and let us know how it goes.

skeptrunedev commented 8 months ago

There's not any public issues (you're the first person to leave an issue!). If you're interested in clustering datasets larger-than-memory, we actually already support minibatch-k-means, which iteratively clusters based on chunks of data. I have no idea how good a job that does or whether it solves your problem though, so it would be awesome if you wanted to give that a try and let us know how it goes.

Yeah actually, I am very game. Will likely run that sometime this week. I think I have to write a bit of code on top to work w/ the qdrant scroll endpoint