Initialize queryable index with init_graph?

ivirshup commented 3 years ago

I would like to be able to more quickly construct an NNDescent object from a precomputed distance matrix, and potentially RP Forest. This is sorta possible right now with the init_graph argument, but the NNDescent object constructed cannot be queried since there's no RP Forest (#103).

Example of this failing

```python import pynndescent from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split train, test = train_test_split(make_blobs(10_000)[0]) from_scratch = pynndescent.NNDescent(train, n_neighbors=15) indices, _ = from_scratch._neighbor_graph from_scratch.query(test) # works from_init = pynndescent.NNDescent(train, n_neighbors=15, init_graph= indices) from_init.query(test) ``` ```pytb --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in ----> 1 from_init.query(test) ~/github/pynndescent/pynndescent/pynndescent_.py in query(self, query_data, k, epsilon) 1585 """ 1586 if not hasattr(self, "_search_graph"): -> 1587 self._init_search_graph() 1588 1589 if not self._is_sparse: ~/github/pynndescent/pynndescent/pynndescent_.py in _init_search_graph(self) 953 954 if not hasattr(self, "_search_forest"): --> 955 tree_scores = [ 956 score_linked_tree(tree, self._neighbor_graph[0]) 957 for tree in self._rp_forest TypeError: 'NoneType' object is not iterable ```

I think this would be fairly straightforward to make work. Here is a rough proof of concept:

Hacky example of this working

```python from_scratch = pynndescent.NNDescent(train, n_neighbors=15) indices, _ = from_scratch._neighbor_graph rp_forest = from_scratch._rp_forest query_indices_scratch, query_distances_scratch = from_scratch.query(test) from_init = pynndescent.NNDescent(train, n_neighbors=15, init_graph=indices) from_init._rp_forest = rp_forest query_indices_init, query_distances_init = from_scratch.query(test) # I think they won't always be exactly the same at the moment, but I think this could be addressed np.testing.assert_allclose(query_indices_scratch, query_indices_init) np.testing.assert_allclose(query_distances_scratch, query_distances_init) ```

Use case

The use case is in single cell analysis, where we are storing the neighbor graph for our dataset (as we use it for multiple purposes) and wanting to be able speed up querying the dataset using reconstructed graphs.

Construction benchmark

Using fmnist data from ann benchmark ```python %time pynndescent.NNDescent(fmnist_train) # CPU times: user 1min 3s, sys: 813 ms, total: 1min 4s # Wall time: 5.92 s # indices taken from ^ %time pynndescent.NNDescent(fmnist_train, init_graph=indices) CPU times: user 10.5 s, sys: 49.2 ms, total: 10.5 s Wall time: 1.4 s ```

Questions

I'm not sure how much validation should/ could be done to make sure initialization values are valid. To some extent, I think it's also alright to say "user beware" here. Some thoughts on checks that could be made:

Of course, the graph, forest, and dataset must have related sizes.
The graph should have the correct k.
If distances could be passed as well as indices, their values could be verified (also verifying the metric)

Could values like rp_forest, _search_function, be created on demand and cached via a property? That might make state easier to manage, and I believe all necessary parameters are already being stored in the NNDescent object.

This is more of an extension, but does the init_graph have to have the correct values of k? For instance, it would ostensibly speed up index creation even if you could only provide an initial graph for a smaller value of K.

(ping @Koncopd)

lmcinnes commented 3 years ago

I certainly agree that the use case makes a lot of sense. A question is how best to actually handle this. Queries will be slower without a forest / search tree as the initialization for the graph search will be slower. In general one doesn't just have trees lying around to pass in as you did in your example, so one would need to generate them. Potentially we could just make a single search tree if _rp_forest is None which may not be the best tree but will do the job. I'll have a look through the code and see if this is something that can be done easily (if so I'll just do it), or if it would require a little more refactoring (in which case it may take a while to get to).

lmcinnes commented 3 years ago

Short answer: basic support is easy, so I added it. In principle the current master branch should work for you now.

That leaves the harder questions about validation. I think "user beware" is reasonable to a degree here. Actually I'm surprised things worked as well as they did for you -- the whole init_graph was actually just a hack I dropped in to let me explore a few research questions, and I didn't bother to pull it out again. I wasn't really expecting other people to use it that much.

On the other hand you do present some pretty reasonable use-cases, so it is not unlikely that others will have similar uses. That means making things a little more user friendly is probably a good idea.

Some answers to specific questions:

Basic sizing of dataset and graph: yes, that's worth checking, and it is (Although not up front, so I could pull the check higher up as well -- the lower level function will still want it).
The graph should have the right k: Given that we go through heap construction this actually isn't necessary (with a little tweak): we can size the resulting graph correctly and use the information we've been given (bigger k or smaller k are both acceptable. A smaller k will induce more work to fill out the graph, but will still provide a starting point).
Distances could be passed as well as a check: This actually gets a little quirky. Pynndescent often cheats on distance functions -- any function that maintains the ordering on distances will do, and so we can often substitute a cheaper option. It is possible that this could be worked around, but it is probably more trouble than it is worth to cover all the cases. I'll give it some thought in case there is a good option here.
It might be possible to make _rp_forest and _search_forest properties, but I think ideally I just handle things correctly -- issues with "create on demand" along with consistency and caching can potentially create as many subtle problems as they fix.

ivirshup commented 3 years ago

Thanks for the quick addition!

In general one doesn't just have trees lying around to pass in as you did in your example, so one would need to generate them.

For sure. My initial test of this was just to change this:

https://github.com/lmcinnes/pynndescent/blob/a0ca8400b681d73e48e4fc2ffe2e905ee89be8dd/pynndescent/pynndescent_.py#L715-L718

to

        if tree_init or n_trees != 0:
            self.tree_init = True
        else:
            self.tree_init = False

so a tree could be constructed even if the init_graph was used for

I guess I see there being multiple cases for passing an initial graph. I think I'd actually view this as making multiple ways to construct a NNDescent object.

From data Pass in data matrix, construct graph. The current initializer workflow.
From fully initialized graph Pass in a completely correct data matrix. This should n't require any optimization. This is the case that would have the most need for validation (at least w.r.t. parameters, distances may not be worth it). In this case, parameters are there for verification, otherwise can be derived from the input (e.g. k).
From partially initialized graph Pass in a partially initialized graph. This could have the wrong number of neighbors, or even have wrong distances. This is just used for initialization (can this be combined with rp_forest initialization reasonably?). Here, parameters override properties of the input.

2 can mostly be achieved from 3, though 2 would be nice for very fast instantiation. However, this probably depends on how quirky the distances get. Is this the case of "alternative" metrics which need to go through a reverse transform (e.g. what you mention here), or there more to it?

lmcinnes commented 3 years ago

Yes, it is about the alternative metrics. The catch being that if and when any get added it is good to have as few places to make changes as possible (and ideally obvious ones). Potentially I can just add an extra parameter init_distances that defaults to None. If provided they will be used regardless of what the metric is defined to be, and no distance recomputation will take place. Of course any querying will still use the defined metric, but the distances themselves are only really needed for graph construction / refining so that wouldn't matter so much. It would also have a degree of "user beware". Although, for example, even euclidean distance uses squared euclidean instead, and the edge pruning will get messed up if you pass in a graph of euclidean distances. I worry that it will cause more harm than good as it will be easy to misuse.

ivirshup commented 3 years ago

Although, for example, even euclidean distance uses squared euclidean instead, and the edge pruning will get messed up if you pass in a graph of euclidean distances.

Could this be solved by having a "reverse_correction" as well as "correction" for the alternative distances?

Also, I've been looking at #79 again, and had a question related to both of these.

How accurate is are the nearest neighbors found by nndecent? While this must vary with number of neighbors and iterations, I was wondering if there was some benchmarking that had been done for accuracy of graph construction (as opposed to querying, like the ANN benchmarks).

My concern here was: if you're providing poor guesses for initial indices, how bad can the results get?

lmcinnes commented 3 years ago

Could this be solved by having a "reverse_correction" as well as "correction" for the alternative distances?

Yes, I'm more concerned about the long term tail on maintaining such things for new distances, and not having something fall out of sync somewhere and creating hard to diagnose errors (I had similar concerns over adding the alternative distances initially, but there were some very significant performance wins for many standard use cases). Potentially this is a very niche use case, so it's a matter of balancing the amount of extra utility with long term overhead.

How accurate is are the nearest neighbors found by nndecent?

That's hard to answer because so much is dataset dependent. It can be very bad for certain metrics and data distributions, but largely it is pretty good. Anecdotally I generally expect 90%+ accuracy. I'm not aware of standard benchmarks for it however. Typically the way to check is to get the true nearest neighbors for a (small) subsample of points and get an estimate of the accuracy that way.

My concern here was: if you're providing poor guesses for initial indices, how bad can the results get?

Pure NNDescent assumes you get an initial graph with random edges and does well, so you can provide quite poor guesses and still do well, it will just take longer to converge. The greater risk is in providing an initial graph with suitable pathological properties (e.g. distinct connected components that don't relate to neighbor structure). You are unlikely to get that with a generic noisy guess at nearest neighbors, but it is certainly possible to generate such graphs (take, say, the leaves of a single rp-tree for example). Again, this is subtle in the ways it can go astray, so a degree of user expertise is required.

jlmelville commented 3 years ago

@lmcinnes this is tangentially related to the current discussion here about accuracy due to nearest neighbor descent, so can I just confirm that the current way pynndescent works is:

To build the index:

Build a forest.
Refine it with nearest neighbor descent.

To query the index:

Initialize guess with the forest from part 1 above.
Refine it with a nearest neighbor descent-like approach (breadth-first search using neighbor of neighbors).

Specifically, It seems like the _neighbor_graph that results from the nearest neighbor descent to build the index doesn't play any role in querying. That is: the refinement doesn't feed back into the forest in any way, the initialization of the query result uses the forest initialization (augmented with random points if necessary, but never the _neighbor_graph results which may not exist any more if compress=True), and the breadth-first search uses neighbors of neighbors taken from the _search_graph which is derived from the _search_forest. Is that correct?

The context for this question is that I continue to sporadically prod at creating a nearest neighbor descent package for R, and my implementation does not use any forest initialization. For very high-dimensional data, results are poor (#18 is still relevant here), in line with the accuracy determined by just using the _neighbor_graph from the index-building step. A "pure" NND approach gives about 45% accuracy via the _neighbor_graph without tree initialization for the dataset I have been using (this is in agreement with my R version which is reassuring). With the forest initialization pynndescent gets a much more respectable 85% accuracy after querying.

I want to make sure I am not missing anything clever being done in the query routine that would work with a naive "use the NND results as initialization" approach. Unfortunately, it doesn't seem easy to replace the forest-initialized indptr and indices data with that from _neighbor_graph, so I haven't been able to check myself with pynndescent.

jamestwebber commented 3 years ago

Rather than open an issue I guess I'll ask here: what format should init_graph take? It seems to be missing from the docstrings, so it's not obvious how to pass in a graph, or what will happen if I do. From looking at the code and reading this discussion I gather that I can initialize the kNN using my own graph rather than the rp tree approach.

Really I can figure out the answer to this myself, but it would be nice if there some hint in the documentation! 🙂

lmcinnes commented 3 years ago

@jamestwebber : The docs are definitely lacking on that front. In reality I hacked something in for my own needs for some experiments, and never got around to making a user friendly version. The short answer is that, right now, it takes a 2D array of shape (n_samples, n_neighbors) such that the entry at (i, j) gives the index of the jth neighbour of the ith data sample. I would happily accept a PR for documentation on that, or perhaps even a user friendlier version (accepting, say, sparse adjacency matrices and networkX graphs as well maybe?), if you cared to try.

@jlmelville : Wow, sorry, this totally slipped through the cracks on me! I apologise for not getting back to you. I think the major thing is that the _search_graph is constructed from the neighbor_graph_; it is really just a pruned version of a symmetrized (i.e. undirected) version of the neighbor_graph_. On the other hand the tree to initialize the search is pretty useful because it ensures that we start from a "decent" place -- perhaps not ideal, but a lot less likely to run into local minima far from the query point. In my experience it mostly helps with speed of convergence though, and not accuracy.

jamestwebber commented 3 years ago

The short answer is that, right now, it takes a 2D array of shape (n_samples, n_neighbors) such that the entry at (i, j) gives the index of the jth neighbour of the ith data sample.

Ah, so exactly what the output looks like! That's handy.

I would happily accept a PR for documentation on that, or perhaps even a user friendlier version (accepting, say, sparse adjacency matrices and networkX graphs as well maybe?), if you cared to try.

As you have no doubt noticed by now, I have a problem of saying I'll contribute PRs and then not getting to them, so I will demur this time. 😂

lmcinnes / pynndescent

Initialize queryable index with init_graph? #110

Use case

Questions