Open zxch3n opened 1 year ago
I explored HNSW a couple months ago. There are currently two HNSW wasm packages:
Thanks for opening the issue @zxch3n ! Could you tell me more about HNSW and why is it a better option than kd tree? I don't know much about HNSW yet.
I explored HNSW a couple months ago. There are currently two HNSW wasm packages:
@jlarmstrongiv That's awesome! What're your thoughts on HNSW?
@DawChihLiou I thought it was really neat! Both packages are open-source on github and up to date with hnswlib v0.7.0
Importantly:
Has full support for incremental index construction and updating the elements. Has support for element deletions (by marking them in index). Index is picklable.
The only thing I couldn’t figure out with the wasm version yet was importing/exporting the index to/from a file in the browser. Notes to self:
it looks like it also supports node? https://github.com/ShravanSunder/hnswlib-wasm/blob/master/src/loadHnswlib.ts#L6-L8
via https://emscripten.org/docs/api_reference/Filesystem-API.html
seems you will need to write something like this: https://stackoverflow.com/a/61580291 basically, to load the nodejs file in the browser: get the node-created file contents as uint8array, mount idbfs, open a idbfs file, write the uint8array to it, then call readIndex on that file:
see emscripten api docs https://emscripten.org/docs/api_reference/Filesystem-API.html#including-file-system-support
this might also be useful https://emscripten.org/docs/api_reference/Filesystem-API.html#FS.createPreloadedFile
you choose what fs to use https://github.com/ShravanSunder/hnswlib-wasm/blob/master/src/loadHnswlib.ts#L30
You can see the details about HNSW in its paper https://arxiv.org/abs/1603.09320
@zxch3n thanks for the suggestion. I'll explore HNSW once Voy is stabilized so we'll be able to compare the benchmarks.
A few more resources:
For high-dimensional data, kd-tree uses O(n) time to do the search. And knowing something about the (Euclidian) distance in one dimension says very little about the distance in the full space.
Ahh, that answers the question I had when reading the kiddo
readme and why their demo and benchmarks only had up to 4D:
Kiddo is ideal for super-fast spatial / geospatial lookups and nearest-neighbour / KNN queries for low-ish numbers of dimensions
The example from https://github.com/tantaraio/voy/issues/21 has a higher dimensionality of 384, not just 4. Unfortunately, most of my data has a much, much, much higher dimensionality than that. It seems like HNSW
is a better fit overall.
What are your thoughts on that @DawChihLiou, since most of the embeddings you have planned have high dimensionality?
I think HNSW is a better fit and worth exploring too. Currently kiddo is capable of handling higher dimensionality and produce quality results. The vectors in the example are in 768d. I'll work on HNSW once Voy's API is more stabilized.
I'll work on HNSW once Voy's API is more stabilized
I think HNSW may change some of Voy’s APIs, like the type of SerializedIndex
among others. I wish I could contribute to this issue, but I’m much more familiar with React and TypeScript
@jlarmstrongiv thanks for your support! I really appreciate it. HNSW will be an internal implementation so it'll most likely to have no effect on the API. SerializedIndex is done to communicate between js and wasm.
hey, I have experimented a little bit with hnsw based vector stores in the browser, check out this link:
https://github.com/xyntopia/vexvault
you can find the relevant part here: https://github.com/Xyntopia/vexvault/blob/vexvault/src/modules/localVectorStore.ts
For high-dimensional data, kd-tree uses O(n) time to do the search. And knowing something about the (Euclidian) distance in one dimension says very little about the distance in the full space.