Open siddhsql opened 1 year ago
This is due to flat table design of the data/link storage. There is a resize_index
method in python which can be called manually.
We can potentially call it automatically once there is not enough space, but it is a O(N) operation which blocks additions for a while an without explicit user call (that also requires an additional global lock, which is a much simpler problem). So we assume that the user takes care of resizes (also doable with a simple python wrapper)
thanks for the response. i have noticed this parameter is not required in FAISS's implementation of HNSW. see this for example. could you explain me why?
I think faiss is fine with O(N) time complexity during insertion. In the code they use std::vector
to store the array with pushes can lead to O(N) complexity on rare random occurrences, which I guess is not an issue if you don't do many small inserts (say, a single huge batch insert, which already has even higher complexity).
hnswlib
, on the other hand, is optimized for a ability to make small searches in small insertions with low latency.
Thanks Yuri. I looked at the code and tried to understand it.
data_level0_memory_ = (char *) malloc(max_elements_ * size_data_per_element_);
is this also the reason why we cannot delete vectors from the index? If we were using linked lists instead of static array then would these two limitations go away?
I see that FAISS implementation does not require 1. but it does have 2nd limitation. any idea why?
did you do any perf tests comparing the perf impact of using LL instead of the static memory block? curious to see if you have any results of the same.
The problem with simple removal of elements is that the graph structure deteriorate over time. I discuss it a bit in https://www.youtube.com/watch?v=4PsyNdFlxmk&t=2247s hnswlib does support deletions, though the index RAM consumption does not shrink in size(but does not grow as well)
thanks Yury. Another question I have is over here:
template <bool has_deletions, bool collect_metrics = false>
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>
searchBaseLayerST(tableint ep_id, const void *data_point, size_t ef, BaseFilterFunctor* isIdAllowed = nullptr) const {
could you explain me why are you using a template for bool? wouldn't it be better to write this function as:
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>
searchBaseLayerST(tableint ep_id, const void *data_point, size_t ef,
bool has_deletions, bool collect_metrics, BaseFilterFunctor* isIdAllowed = nullptr) const {
I don't know much about C++ so asking.
Templates are faster because they allow not doing checks in critical loops by removing them at compile time (collect_metrics=true
and collect_metrics=false
lead to different binary code).
from the docs:
why do we need to declare the maximum number of elements the index can store beforehand? i don't see any constraint like this in the algorithm described in the hnswlib paper. thanks.