milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.96k stars 2.95k forks source link

[Feature]: Support local disk caching and Mmap data #21866

Open xiaofan-luan opened 1 year ago

xiaofan-luan commented 1 year ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

Now Milvus fully loaded vector index into memory to support query/search, but it required too much memory and could cause OOM if memory is not enough.

To improve, we could define load as put data into local disk, and mmap the data into memory. All memory in data will be managed by operating system page cache and user can loaded larger dataset into milvus without fully in memory (If memory is enough, I would expect similar performance compared to current in memory version).

There are few things we need to investigate before put this on our schedule:

  1. how to mmap scalar data, including delta files(maybe all in memory for now)? what is the performance with 1%, 10%, 50%, 90%, 100% memory/disk percentage?
  2. how to mmap vector index, what about the performance?
  3. how about scalar index? such as marisa trie? inverted index should be ok to mmap directly since lucene is doing so.
  4. how much longer does it take to load into disk compare to directly into memory?

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

yah01 commented 1 year ago

/assign

yah01 commented 1 year ago
yah01 commented 1 year ago

The IVF_FLAT doesn't support mmap for now, due to it stores the original data separately. Will work on it after the C++ segment loader ready

yah01 commented 1 year ago

I'm going to support IVF index with mmap as Knowhere has changed IVF impl to contain data part

xiaofan-luan commented 1 year ago

I'm going to support IVF index with mmap as Knowhere has changed IVF impl to contain data part

@faiss already support mmap are we gonna to simply enable it? BTW, is there a way to enable mmap only on some field?

yah01 commented 1 year ago

I'm going to support IVF index with mmap as Knowhere has changed IVF impl to contain data part

@faiss already support mmap are we gonna to simply enable it? BTW, is there a way to enable mmap only on some field?

Need to dive into the faiss impl and file format

yah01 commented 1 year ago

@cydrain would this index contain the vector data if it was created in old version?

xiaofan-luan commented 1 year ago

I'm going to support IVF index with mmap as Knowhere has changed IVF impl to contain data part

@faiss already support mmap are we gonna to simply enable it? BTW, is there a way to enable mmap only on some field?

Need to dive into the faiss impl and file format

We actually have a user who want to run mmap on FLAT index

patelprateek commented 1 year ago

@yah01 : does faiss also support adding metadata along with embeddings or is this only done by knowhere ?

xiaofan-luan commented 1 year ago

@yah01 : does faiss also support adding metadata along with embeddings or is this only done by knowhere ?

faiss does‘t have idea of metadata