Open andersonbcdefg opened 9 months ago
@andersonbcdefg Can I try your dataset?
Great, @andersonbcdefg I downloaded this data file . What model/service did you use to vectorize the text? Can you share your vectorization technique/code? That way i can try to reproduce exactly what you are seeing.
Thanks for the issue, @andersonbcdefg! Current version has very high variance depending on the dataset and other function arguments. I will be releasing a different algorithm in v3 😉 Will you be open to help test it before the public release?
i'm down! i did just rip all the usearch clustering out of my codebase and replace it with K-means, haha. but i'll test it outside of prod on similar datasets and see if it's faster, and if so i can put it back! :D I do think the resulting K-means clusters are a bit worse, but it's 20 seconds for streaming k-means vs 10 minutes for usearch so that was a pretty major difference.
@sourcesync I didn't do anything crazy just compute dense embeddings with an open-source model like BGE-small over the "conversations" field converted to text by unrolling the conversation with "user:" and "assistant:" prefixes
@sourcesync I didn't do anything crazy just compute dense embeddings with an open-source model like BGE-small over the "conversations" field converted to text by unrolling the conversation with "user:" and "assistant:" prefixes
Thanks @andersonbcdefg. Per Ash's comment, I was assuming there is something going on with the statistical distribution of your vectors giving you worst-case performance. When I get a chance, I'll try a different vectorizer using the same fields you used.
Describe the bug
Not sure if it's a bug, but the Usearch README led me to expect near-real-time clustering even for large indexes. However, I'm finding that at the 1M point scale,
index.cluster
takes 500+ seconds.Steps to reproduce
Embed a large dataset of ~1M points and insert into usearch index. (I used 384-dim vectors.)
Expected behavior
Clustering takes seconds or a few minutes.
USearch version
v2.8.15
Operating System
Debian 11
Hardware architecture
x86
Which interface are you using?
Python bindings
Contact Details
andersonbcdefg@gmail.com
Is there an existing issue for this?
Code of Conduct