Code refactoring - Githubissues

gopalrs commented 1 year ago

A lot of the code in DiskANN is currently monolithic and written in imperative style. Some thoughts on making DiskANN more extensible:

Make a correct Index class hierarchy. As suggested in a different issue, the BasicIndex class should simply have the graph data structure, insertion and search.
Incremental and filter indexes can derive from BasicIndex and add their functionality
Ideally, PQFlashIndex should also derive from BasicIndex, but we could make it derive from an interface which BasicIndex implements. Then, BasicIndex can be a part of the PQFlashIndex class but is only created during index build and not during search.
PQ data can be a member of the PQFlashIndex with PQ methods moved into a separate class. We could have an interface for PQ/OPQ and clients should be able to plugin one or the other.
Customizations we make for different distance functions (like normalization for cosine and inner product) can also be strategies that are invoked during index build/search.
Ideally, there should be a search Template method which Index class defines along with helper overrides in the derived class that retrieves the next element from the graph, and so on. But in a first step, we can have separate search functions.
With all these changes, we should be able to eliminate the aux_utils.cpp file and move its functionality into specific classes.
Save and load can also be moved to helper classes. This will give clients the flexibility to decide where and how to store the DiskANN graph

Will expand on this with more details.

harsha-simhadri commented 1 year ago

@gopalrs Did you get a chance to look at PDF Magdalen shared here: https://github.com/microsoft/DiskANN/issues/166

Is it possible to merge this discussion with that.

gopalrs commented 1 year ago

Will do, forgot about that

microsoft / DiskANN