jbellis / jvector

JVector: the most advanced embedded vector search engine
Apache License 2.0
1.52k stars 112 forks source link

How to build a very large index at once on a machine which is old and toy-like computer equipment #166

Closed xjtushilei closed 10 months ago

xjtushilei commented 11 months ago

M64-32ms: a virtual machine with dual Xeon E7-8890v3s (32-vCPUs) with 1792GB DDR3 RAM that we use to build a one-shot in-memory index for billion point datasets.

Although DiskANN does not use a lot of memory when using it, the construction consumes a lot of memory. As shown above, Microsoft used a machine with nearly 1.8TB of memory to build 1 billion pieces of vector data at once, even though it only had 128 dimensions.

Smaller Vamana indices for overlapping partitions of a large dataset can be easily merged into one index that provides nearly the same search performance as a single-shot index constructed for the entire dataset. This allows indexing of datasets that are otherwise too large to fit in memory.

As shown in the above quote, DiskANN's paper mentions the way of constructing a large vamana index from some small vamana index.

I wonder if jvector has any plans?

jbellis commented 11 months ago

Do you have a use case where you need to build larger-than-memory indexes or are you just saying this would be cool?

jbellis commented 11 months ago

I wrote up what I think would be the "right" way to add larger than memory construction: https://github.com/jbellis/jvector/issues/168

xjtushilei commented 11 months ago

Do you have a use case where you need to build larger-than-memory indexes or are you just saying this would be cool?

I need to build many indexes with relatively large data volumes, but the memory is limited. Now we are forced to use a lot of large memory machines, but after the build is completed these machines are no longer needed.

I wrote up what I think would be the "right" way to add larger than memory construction: #168

I saw it, thanks for clarification.