do not copy vectors and support mmap files

louisabraham commented 3 years ago

I see a huge potential of improvement at a very low cost. Currently indexes are very optimized in size (and one can control it with parameters) but we still have to copy all data.

My proposal has two steps:

add an option to avoid copying data when adding it to the index. Instead we just store a reference / pointer. This should be an experimental feature because it makes the user responsible of memory management. It can be tricky to interface with C++ but numpy works well IMO. The gain would be a factor 2 for free (currently it's still possible to add items in batches to avoid duplication).
support numpy's memmap / POSIX's mmap. This is where the real gains come in because it allows one to only store the index in the RAM and keep the much larger data on the disk. The index could also be on disk but it could decrease performance too much (I don't recall the paper well enough to be confident). It will make queries slower but allows much larger databases so it's a time/memory trade-off that can come in handy in a lot of cases (large databases and limited RAM with non critical access time).

I can help to implement this but I first need to know:

if it seems doable
where I should start

searchivarius commented 3 years ago

Let me chime in on this, though it's Yura's package. It seems to me that copying data and even indexing it has a negligible cost nowadays compared to the effort spent on generating these vectors.

louisabraham commented 3 years ago

That's good news since it means the performance won't be affected too much. Note that my focus is really on decreasing RAM usage. HNSW is my go-to ANN package but it lacks an option to index large datasets on a machine with limited RAM, like annoy proposes.

searchivarius commented 3 years ago

But Annoy would still be slow if index doesn't fit into RAM?

louisabraham commented 3 years ago

It's acceptable thanks to mmap, especially on a SSD.

I'm just curious how hard it would be to do in HNSW and how that would compare in terms of speed.

yurymalkov commented 3 years ago

Hi @louisabraham

One thing that comes to my mind is that if we access the python data from hnswlib it seems to require getting a GIL lock, making the code effectively single-threaded. We can store the C array memory references, but not sure it will work with numpy mmap (in that case it is not as useful for the scenario). If those problems can be solved it would be an interesting addition and probably can done as a separate space in C++ library (and some changes in bindings).
It seems interesting and worth adding. Probably, the easiest option is to use unix mmap instead of malloc for the lower layer (there was a PR #227 which already looked at it, but had many changing including the api, maybe we can reuse it), though it will not work on windows without additional effort.

Concerning what to store - for each read of the data there is a corresponding read of the link list during the search and, in addition, they can be collocated (in case mmap is done within the index), so probably putting them together would not slow down index that much. Something similar (storage of both links and data in the lower level) was studied in [https://proceedings.neurips.cc//paper/2020/file/788d986905533aba051261497ecffcbb-Paper.pdf] (and overall IMO it makes sense to bother insights from the paper).

dbespalov commented 3 years ago

@louisabraham @yurymalkov

I think both of these features would be extremely useful!!

I have recently added pickle support to the python bindings, and the internal data storage of the HierarchicalNSW (NHSW) class, implemented in hnswlag.h, is fresh in my head. Maybe, I can offer some insights here as well.

Data points are stored in data_level0_memory_ array, defined on line 120. The points are stored together with external labels, and links for the first layer (level 0). They are copied into data_level0_memory_ by HNSW methods addPoint and updatePoint on lines 1030 and 833. Access to points in data_level0_memory_ is encapsulated with inline method getDataByInternalId defined on line 151.

One option could be to store point id's in data_level0_memory_, instead of actual data points. These id's would then be used by getDataByInternalId to load points from some external table. In this case, signatures for addPoint and updatePoint would have to change from

void addPoint(const void *data_point, labeltype label) ;
void updatePoint(const void *dataPoint, tableint internalId, float updateNeighborProbability) ;

to something like

 void addPoint(pointtype point_id, labeltype label);
 void updatePoint(pointtype point_id, tableint internalId, float updateNeighborProbability);

I think as long as points are not modified after insertion into the table, synchronization logic from HNSW should work for this case as is. So, maybe the original HNSW class can be extended to implement both features.

Loading table from memory-mapped files should be fairly straight-forward. File path can be passed to the constructor (and no issues with serialization). Not sure about the choice between numpy's memmap and POSIX's mmap here though.
I dont see a clean way for the case when the table is defined in main memory, and passed to HNSW as a pointer or a numpy array. I can check pybind11 docs, maybe there is something that can be leveraged here.

jonathangomesselman commented 2 years ago

@dbespalov @yurymalkov @louisabraham

This is an incredibly interesting thread and very relevant to a use case that I am looking into where I am hoping to use an ANN algorithm on a machine with limited RAM. I was wondering if anything ever came of this thread in terms of allowing for memory-mapped files? Thanks in advance for more info!

I am also curious @louisabraham if since creating this thread you have worked with other ANN libraries that accomplish this with similar performance to HNSW (which is pretty sweet!)?

louisabraham commented 2 years ago

Did you have a look at Annoy?

jonathangomesselman commented 2 years ago

I have a bit, but will certainly look more! Were you using the Spotify annoy library?

From looking at the ANN Benchmarks, I was hoping to find one of the stronger performing benchmarks - annoy seems relatively low performing in (Recall/Query speed) and generates a very large on disk index; however, maybe we get these performance hits because annoy is optimized for memory mapping and production characteristics. Eager to dig more into this, thanks @louisabraham!

nmslib / hnswlib

do not copy vectors and support mmap files #259