unum-cloud / usearch

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
https://unum-cloud.github.io/usearch/
Apache License 2.0
2.27k stars 142 forks source link

Bug: Clustering is really really slow #347

Open andersonbcdefg opened 9 months ago

andersonbcdefg commented 9 months ago

Describe the bug

Not sure if it's a bug, but the Usearch README led me to expect near-real-time clustering even for large indexes. However, I'm finding that at the 1M point scale, index.cluster takes 500+ seconds.

Steps to reproduce

Embed a large dataset of ~1M points and insert into usearch index. (I used 384-dim vectors.)

Expected behavior

Clustering takes seconds or a few minutes.

USearch version

v2.8.15

Operating System

Debian 11

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

andersonbcdefg@gmail.com

Is there an existing issue for this?

Code of Conduct

sourcesync commented 9 months ago

@andersonbcdefg Can I try your dataset?

andersonbcdefg commented 9 months ago

https://huggingface.co/datasets/teknium/OpenHermes-2.5

sourcesync commented 9 months ago

Great, @andersonbcdefg I downloaded this data file . What model/service did you use to vectorize the text? Can you share your vectorization technique/code? That way i can try to reproduce exactly what you are seeing.

ashvardanian commented 9 months ago

Thanks for the issue, @andersonbcdefg! Current version has very high variance depending on the dataset and other function arguments. I will be releasing a different algorithm in v3 😉 Will you be open to help test it before the public release?

andersonbcdefg commented 9 months ago

i'm down! i did just rip all the usearch clustering out of my codebase and replace it with K-means, haha. but i'll test it outside of prod on similar datasets and see if it's faster, and if so i can put it back! :D I do think the resulting K-means clusters are a bit worse, but it's 20 seconds for streaming k-means vs 10 minutes for usearch so that was a pretty major difference.

andersonbcdefg commented 9 months ago

@sourcesync I didn't do anything crazy just compute dense embeddings with an open-source model like BGE-small over the "conversations" field converted to text by unrolling the conversation with "user:" and "assistant:" prefixes

sourcesync commented 9 months ago

@sourcesync I didn't do anything crazy just compute dense embeddings with an open-source model like BGE-small over the "conversations" field converted to text by unrolling the conversation with "user:" and "assistant:" prefixes

Thanks @andersonbcdefg. Per Ash's comment, I was assuming there is something going on with the statistical distribution of your vectors giving you worst-case performance. When I get a chance, I'll try a different vectorizer using the same fields you used.