tantaraio / voy

🕸️🦀 A WASM vector similarity search written in Rust
https://www.npmjs.com/package/voy-search
Apache License 2.0
867 stars 31 forks source link

Why use kd-tree rather than HNSW? #23

Open zxch3n opened 1 year ago

zxch3n commented 1 year ago

For high-dimensional data, kd-tree uses O(n) time to do the search. And knowing something about the (Euclidian) distance in one dimension says very little about the distance in the full space.

jlarmstrongiv commented 1 year ago

I explored HNSW a couple months ago. There are currently two HNSW wasm packages:

DawChihLiou commented 1 year ago

Thanks for opening the issue @zxch3n ! Could you tell me more about HNSW and why is it a better option than kd tree? I don't know much about HNSW yet.

DawChihLiou commented 1 year ago

I explored HNSW a couple months ago. There are currently two HNSW wasm packages:

@jlarmstrongiv That's awesome! What're your thoughts on HNSW?

jlarmstrongiv commented 1 year ago

@DawChihLiou I thought it was really neat! Both packages are open-source on github and up to date with hnswlib v0.7.0

Importantly:

Has full support for incremental index construction and updating the elements. Has support for element deletions (by marking them in index). Index is picklable.

The only thing I couldn’t figure out with the wasm version yet was importing/exporting the index to/from a file in the browser. Notes to self:

zxch3n commented 1 year ago

You can see the details about HNSW in its paper https://arxiv.org/abs/1603.09320

DawChihLiou commented 1 year ago

@zxch3n thanks for the suggestion. I'll explore HNSW once Voy is stabilized so we'll be able to compare the benchmarks.

A few more resources:

jlarmstrongiv commented 1 year ago

For high-dimensional data, kd-tree uses O(n) time to do the search. And knowing something about the (Euclidian) distance in one dimension says very little about the distance in the full space.

Ahh, that answers the question I had when reading the kiddo readme and why their demo and benchmarks only had up to 4D:

Kiddo is ideal for super-fast spatial / geospatial lookups and nearest-neighbour / KNN queries for low-ish numbers of dimensions

The example from https://github.com/tantaraio/voy/issues/21 has a higher dimensionality of 384, not just 4. Unfortunately, most of my data has a much, much, much higher dimensionality than that. It seems like HNSW is a better fit overall.

What are your thoughts on that @DawChihLiou, since most of the embeddings you have planned have high dimensionality?

DawChihLiou commented 1 year ago

I think HNSW is a better fit and worth exploring too. Currently kiddo is capable of handling higher dimensionality and produce quality results. The vectors in the example are in 768d. I'll work on HNSW once Voy's API is more stabilized.

jlarmstrongiv commented 1 year ago

I'll work on HNSW once Voy's API is more stabilized

I think HNSW may change some of Voy’s APIs, like the type of SerializedIndex among others. I wish I could contribute to this issue, but I’m much more familiar with React and TypeScript

DawChihLiou commented 1 year ago

@jlarmstrongiv thanks for your support! I really appreciate it. HNSW will be an internal implementation so it'll most likely to have no effect on the API. SerializedIndex is done to communicate between js and wasm.

yeus commented 4 months ago

hey, I have experimented a little bit with hnsw based vector stores in the browser, check out this link:

https://github.com/xyntopia/vexvault

you can find the relevant part here: https://github.com/Xyntopia/vexvault/blob/vexvault/src/modules/localVectorStore.ts