Open louisabraham opened 3 years ago
Let me chime in on this, though it's Yura's package. It seems to me that copying data and even indexing it has a negligible cost nowadays compared to the effort spent on generating these vectors.
That's good news since it means the performance won't be affected too much. Note that my focus is really on decreasing RAM usage. HNSW is my go-to ANN package but it lacks an option to index large datasets on a machine with limited RAM, like annoy proposes.
But Annoy would still be slow if index doesn't fit into RAM?
It's acceptable thanks to mmap, especially on a SSD.
I'm just curious how hard it would be to do in HNSW and how that would compare in terms of speed.
Hi @louisabraham
Concerning what to store - for each read of the data there is a corresponding read of the link list during the search and, in addition, they can be collocated (in case mmap is done within the index), so probably putting them together would not slow down index that much. Something similar (storage of both links and data in the lower level) was studied in [https://proceedings.neurips.cc//paper/2020/file/788d986905533aba051261497ecffcbb-Paper.pdf] (and overall IMO it makes sense to bother insights from the paper).
@louisabraham @yurymalkov
I think both of these features would be extremely useful!!
I have recently added pickle support to the python bindings, and the internal data storage of the HierarchicalNSW (NHSW) class, implemented in hnswlag.h, is fresh in my head. Maybe, I can offer some insights here as well.
Data points are stored in data_level0_memory_
array, defined on line 120. The points are stored together with external labels, and links for the first layer (level 0). They are copied into data_level0_memory_
by HNSW methods addPoint
and updatePoint
on lines 1030 and 833. Access to points in data_level0_memory_
is encapsulated with inline method getDataByInternalId
defined on line 151.
One option could be to store point id's in data_level0_memory_
, instead of actual data points. These id's would then be used by getDataByInternalId
to load points from some external table. In this case, signatures for addPoint
and updatePoint
would have to change from
void addPoint(const void *data_point, labeltype label) ;
void updatePoint(const void *dataPoint, tableint internalId, float updateNeighborProbability) ;
to something like
void addPoint(pointtype point_id, labeltype label);
void updatePoint(pointtype point_id, tableint internalId, float updateNeighborProbability);
I think as long as points are not modified after insertion into the table, synchronization logic from HNSW should work for this case as is. So, maybe the original HNSW class can be extended to implement both features.
Loading table from memory-mapped files should be fairly straight-forward. File path can be passed to the constructor (and no issues with serialization). Not sure about the choice between numpy's memmap and POSIX's mmap here though.
I dont see a clean way for the case when the table is defined in main memory, and passed to HNSW as a pointer or a numpy array. I can check pybind11 docs, maybe there is something that can be leveraged here.
@dbespalov @yurymalkov @louisabraham
This is an incredibly interesting thread and very relevant to a use case that I am looking into where I am hoping to use an ANN algorithm on a machine with limited RAM. I was wondering if anything ever came of this thread in terms of allowing for memory-mapped files? Thanks in advance for more info!
I am also curious @louisabraham if since creating this thread you have worked with other ANN libraries that accomplish this with similar performance to HNSW (which is pretty sweet!)?
Did you have a look at Annoy?
I have a bit, but will certainly look more! Were you using the Spotify annoy library?
From looking at the ANN Benchmarks, I was hoping to find one of the stronger performing benchmarks - annoy seems relatively low performing in (Recall/Query speed) and generates a very large on disk index; however, maybe we get these performance hits because annoy is optimized for memory mapping and production characteristics. Eager to dig more into this, thanks @louisabraham!
I see a huge potential of improvement at a very low cost. Currently indexes are very optimized in size (and one can control it with parameters) but we still have to copy all data.
My proposal has two steps:
I can help to implement this but I first need to know: