microsoft / DiskANN

Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search
Other
1.16k stars 227 forks source link

Memory map load vectors_from_file if requested #592

Closed daxpryce closed 1 month ago

daxpryce commented 1 month ago

This commit adds some basic memmap support to the diskannpy.vectors_from_file utility function. You can now return a memory mapped np.array conformant view (np.memmap object) vs. requiring it be loaded fully into memory (np.ndarray style).

What does this implement/fix? Briefly explain your changes.

Any other comments?

Note that I set the default to be mode="r" not "r+", which is the numpy default. I would prefer we don't mess with mutating the vectors prior to handing them to diskann for build, I have no idea if that way even works. I also wanted to leave it open until we figured out if it did.

daxpryce commented 1 month ago

(note: I'll want to rebase on main after the mostly-fixed-the-build PR is merged)

gopalrs commented 1 month ago

Sorry for the delay - GH notifications go to my gmail account, and I don't look at it often enough

daxpryce commented 1 month ago

Should we update the dependencies? (Do we even have a requirements.txt?)

Dependencies are in pyproject.toml - which is where you want dependencies defined for the library you're publishing. requirements.txt are more for applications that don't get published, and thus, won't be imported.

Anyway, the only strong requirement we have in our deps is numpy - which is actually a big challenge, as we use the numpy headers @ cxx compile time and we need to create compiled wheels for every version of numpy, which in turn can't really be published to pypi as there's no structural way to keep the "compiled for python 3.10 and numpy 1.25" separate from "compiled for python 3.11 and numpy 2.0.2". that's what conda is for, really. See https://github.com/microsoft/DiskANN/issues/544 for more details.


Continuing on this topic but also slightly tangential, I have absolutely seen that pyarrow and polars have native (read: something was compiled) support for numpy without it being a fixed version. I have no idea how they've done it, but it may very well be possible and we can be more loosey goosey with our strict numpy requirements.


Which then dovetails into the actual answer to your question: we can absolutely upgrade the requirements, but they're basically all build time requirements, not runtime. Numpy is the only one that is a runtime requirement. Long answer shortened: "maybe, but not right now, we should look into it a bit deeper first to find a more optimal way of handling the numpy dep that isn't 'change the strict numpy 1.25 requirement into a strict numpy 2.0 requirement' because it's all sorts of crap for a bunch of reasons :)

daxpryce commented 1 month ago

@gopalrs

I'm getting occassional failures in one of the unit tests that seems to be around filtered memory indices. In short, my test compares an index search results filters, vs. an index search with filters (but every element has the same label). The expectation is the two result sets will be identical, but that does not seem to be the case.

I've added a comment to the unit test explaining it with your name in it to showcase what I'm doing, but because this is a: sporadic and b: seems to be a result of some changes in the cxx itself (nothing in the python has changed), I wanted to flag it as a test case that maybe the Cxx tests aren't catching or covering.