lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
897 stars 105 forks source link

access distances in heap #211

Open bobermayer opened 1 year ago

bobermayer commented 1 year ago

Hi,

I'm trying to use sparse distance matrix for agglomerative hierarchical clustering (using the algorithms from gbbs). While it works reasonably well for my use case using only nearest neighbors as input, it would be good to have also (some) distances between points that are further apart. I can add these later from a full distance matrix of a random subset of course, but I wonder if it's possible to somehow get all the distances that were calculated during the search? they don't seem to be accessible from the rp_forest, but I have only a superficial understanding of the code.

help would be very much appreciated!

lmcinnes commented 1 year ago

So the bad news is that distances calculated during the search are thrown away to save memory, so they just aren't accessible as you would want.

The possibly good news is that there are some hidden features that may do what you want. There is a module graph_utils that you can import as pynndescent .graph_utils that has added functionality to "complete" a knn graph to have a single connected component (specifically pynndescent.graph_utils.connect_graph). If that isn't quite what you need (for your clustering purposes) then the function find_component_connection_edge, which finds the shortest edge between two components in the graph (well, approximates it with an approximate search) can be modified to return a set of edges, and that might fit your needs. Are any of these useful, or do you need a more general random sampling of distances?

bobermayer commented 1 year ago

thanks for the reply. looking a bit more into the code, I suspected as much. I have to check whether my graphs are fully connected to see if the functionality you mention would help. other than that, I was wondering if I could maybe increase the heap size, but I guess it's always assumed to be exactly n_neighbors and would anyway only save the smallest distances? carrying through another bigger (unsorted?) heap could be a less efficient way of saving all those distances, not sure the overhead is worth it, because calculating them is not actually that expensive, so maybe that's what I'll try first.

bobermayer commented 1 year ago

ok, so I created a fork where the distances discarded in the heap_push functions are saved instead to backup arrays. this seems to work but I haven't done any detailed testing yet. I could try to implement this in a more elegant way in case you're interested.

however, I'm getting errors with numba when installing the package locally (using python setup.py install): I can run it once, but I'm getting seg faults afterwards (more specifically, I can run it multiple times in the same python session, but not in successive independent python sessions). this also happens with your master branch though. stack traces suggest a problem in rp_trees.make_dense_tree (but unrelated to #209 as it happens with euclidean and correlation distance metrics) , and it might be caused by indices = np.arange(data.shape[0]).astype(np.int32) in line 830

No implementation of function Function(<built-in function arange>) found for signature:
  >>> arange(int64)

thanks for any input. I have python 3.8.0, numpy 1.23.5, numba 0.53.1 (or 0.56.1 on another system, similar problem)

update: turning off the numba caching for the make_tree functions in rp_trees.py appears to resolve this issue.

another update: maybe related to caching problems for recursive functions?