Closed daxpryce closed 1 month ago
(note: I'll want to rebase on main after the mostly-fixed-the-build PR is merged)
Sorry for the delay - GH notifications go to my gmail account, and I don't look at it often enough
Should we update the dependencies? (Do we even have a requirements.txt?)
Dependencies are in pyproject.toml - which is where you want dependencies defined for the library you're publishing. requirements.txt are more for applications that don't get published, and thus, won't be imported.
Anyway, the only strong requirement we have in our deps is numpy - which is actually a big challenge, as we use the numpy headers @ cxx compile time and we need to create compiled wheels for every version of numpy, which in turn can't really be published to pypi as there's no structural way to keep the "compiled for python 3.10 and numpy 1.25" separate from "compiled for python 3.11 and numpy 2.0.2". that's what conda is for, really. See https://github.com/microsoft/DiskANN/issues/544 for more details.
Continuing on this topic but also slightly tangential, I have absolutely seen that pyarrow and polars have native (read: something was compiled) support for numpy without it being a fixed version. I have no idea how they've done it, but it may very well be possible and we can be more loosey goosey with our strict numpy requirements.
Which then dovetails into the actual answer to your question: we can absolutely upgrade the requirements, but they're basically all build time requirements, not runtime. Numpy is the only one that is a runtime requirement. Long answer shortened: "maybe, but not right now, we should look into it a bit deeper first to find a more optimal way of handling the numpy dep that isn't 'change the strict numpy 1.25 requirement into a strict numpy 2.0 requirement' because it's all sorts of crap for a bunch of reasons :)
@gopalrs
I'm getting occassional failures in one of the unit tests that seems to be around filtered memory indices. In short, my test compares an index search results
I've added a comment to the unit test explaining it with your name in it to showcase what I'm doing, but because this is a: sporadic and b: seems to be a result of some changes in the cxx itself (nothing in the python has changed), I wanted to flag it as a test case that maybe the Cxx tests aren't catching or covering.
This commit adds some basic memmap support to the
diskannpy.vectors_from_file
utility function. You can now return a memory mapped np.array conformant view (np.memmap object) vs. requiring it be loaded fully into memory (np.ndarray style).What does this implement/fix? Briefly explain your changes.
Any other comments?
Note that I set the default to be mode="r" not "r+", which is the numpy default. I would prefer we don't mess with mutating the vectors prior to handing them to diskann for build, I have no idea if that way even works. I also wanted to leave it open until we figured out if it did.