nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.31k stars 633 forks source link

auto index resize implementation #503

Closed patelprateek closed 1 year ago

patelprateek commented 1 year ago

I saw this feature has been asked a couple of times and wanted to get your thoughts on possible implementation strategy

(assuming we are ok with the additional runtime complexity overhead it brings due to stalling of writes while copying data )

I saw couple of comments on past issues https://github.com/nmslib/hnswlib/issues/267#issuecomment-744201836 :

resize is not thread safe with insertion (e.g. some other threads, including python ones, need to finish before copying). This might be a part of larger overhaul of synchronization.

https://github.com/nmslib/hnswlib/issues/481#issuecomment-1631687727 :

There is a resize_index method in python which can be called manually.
We can potentially call it automatically once there is not enough space, but it is a O(N) operation which blocks additions for a while an without explicit user call (that also requires an additional global lock, which is a much simpler problem). So we assume that the user takes care of resizes (also doable with a simple python wrapper)

Reading the recent master branch code it seems like resizing can be done at this point https://github.com/nmslib/hnswlib/blob/master/hnswlib/hnswalg.h#L1091

            if (cur_element_count >= max_elements_) {
                // Ensure all threads within `addPoint` are done  (some atomic counter incremented/decremented on entering and exiting 
 addpoint method)
               // call void resizeIndex(size_t new_max_elements) 
            }

The part i am not clear about is why would we need additional global lock ? seems like label_lookup_lock already acts as a global synchronization .