why do we need to declare the maximum number of elements the index can store beforehand?

nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors

https://github.com/nmslib/hnswlib

Apache License 2.0

4.31k stars 633 forks source link

why do we need to declare the maximum number of elements the index can store beforehand? #481

Open siddhsql opened 1 year ago

siddhsql commented 1 year ago

from the docs:

# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

why do we need to declare the maximum number of elements the index can store beforehand? i don't see any constraint like this in the algorithm described in the hnswlib paper. thanks.

yurymalkov commented 1 year ago

This is due to flat table design of the data/link storage. There is a resize_index method in python which can be called manually. We can potentially call it automatically once there is not enough space, but it is a O(N) operation which blocks additions for a while an without explicit user call (that also requires an additional global lock, which is a much simpler problem). So we assume that the user takes care of resizes (also doable with a simple python wrapper)

siddhsql commented 1 year ago

thanks for the response. i have noticed this parameter is not required in FAISS's implementation of HNSW. see this for example. could you explain me why?

yurymalkov commented 1 year ago

I think faiss is fine with O(N) time complexity during insertion. In the code they use std::vector to store the array with pushes can lead to O(N) complexity on rare random occurrences, which I guess is not an issue if you don't do many small inserts (say, a single huge batch insert, which already has even higher complexity). hnswlib, on the other hand, is optimized for a ability to make small searches in small insertions with low latency.

siddhsql commented 1 year ago

Thanks Yuri. I looked at the code and tried to understand it.

data_level0_memory_ = (char *) malloc(max_elements_ * size_data_per_element_);

is this also the reason why we cannot delete vectors from the index? If we were using linked lists instead of static array then would these two limitations go away?

have to declare in advance the number of vectors we will be storing in the index
cannot delete vectors from the index. by delete I mean a real delete leading to freeing up of some memory.

I see that FAISS implementation does not require 1. but it does have 2nd limitation. any idea why?

did you do any perf tests comparing the perf impact of using LL instead of the static memory block? curious to see if you have any results of the same.

yurymalkov commented 1 year ago

The problem with simple removal of elements is that the graph structure deteriorate over time. I discuss it a bit in https://www.youtube.com/watch?v=4PsyNdFlxmk&t=2247s hnswlib does support deletions, though the index RAM consumption does not shrink in size(but does not grow as well)

siddhsql commented 1 year ago

thanks Yury. Another question I have is over here:

template <bool has_deletions, bool collect_metrics = false> 
    std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>
    searchBaseLayerST(tableint ep_id, const void *data_point, size_t ef, BaseFilterFunctor* isIdAllowed = nullptr) const {

could you explain me why are you using a template for bool? wouldn't it be better to write this function as:

std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>
    searchBaseLayerST(tableint ep_id, const void *data_point, size_t ef, 
bool has_deletions, bool collect_metrics, BaseFilterFunctor* isIdAllowed = nullptr) const {

I don't know much about C++ so asking.

yurymalkov commented 1 year ago

Templates are faster because they allow not doing checks in critical loops by removing them at compile time (collect_metrics=true and collect_metrics=false lead to different binary code).