lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
897 stars 105 forks source link

Does n_search_trees do anything? #139

Closed jlmelville closed 3 years ago

jlmelville commented 3 years ago

Looking at the rp trees code it seems that n_search_trees may no longer do anything: only the best-scoring tree is used in search.

I used the F-MNIST notebook to generate some extra data (warning: you will have to wait around ten minutes to use the sklearn kd-tree implementation to get at the ground truth if you run it this way):

index = pynndescent.NNDescent(fmnist_train, random_state=42, verbose=True, n_iters=1)
neighbors = index.query(fmnist_test, epsilon=0)
np.mean(accuracy_per_query_point(neighbors, tree_neighbors))

# eventually gives
0.8878799999999999

I have purposely made the nearest neighbor descent and search routines not perform well so they don't swamp the effect of the tree initialization. Repeating with n_search_trees=10:

index_st10 = pynndescent.NNDescent(fmnist_train, verbose=True, random_state=42, n_iters=1, 
n_search_trees=10)
neighbors_st10 = index_st10.query(fmnist_test, epsilon=0)
np.mean(accuracy_per_query_point(neighbors_st10, tree_neighbors))

# also gives
0.8878799999999999

There seems to be no effect whatsoever. The _search_forest field goes through various transformations at different points in the code so I had trouble fully confirming what's going on, but looking at _init_search_function: https://github.com/lmcinnes/pynndescent/blob/ee915d2599c1fd36e13ae4ac15ba91ecee503c86/pynndescent/pynndescent_.py#L1131-L1139

It does seem that only the first search tree data is used (_search_forest is not read from again in that method or the internally-defined functions).

lmcinnes commented 3 years ago

I think it may be dead code. I would have to go through and check in detail, but for a while using less trees in search seemed like a good idea, but I wasn't sure how many were needed, so I made it a parameter. I certainly found that ultimately the first tree is really all that is needed, and so some of the code may now assume that. I obviously didn't propagate that all the way through apparently. For search one tree is enough though.

jlmelville commented 3 years ago

Is the idea of scoring the forest against the NND-optimized neighbor graph to find the single best tree a McInnes innovation? That's a much better and simpler idea than what I was grasping at in my question in #110 about refining the trees from the NND result.

Anyway, for some practical actions from this, I could probably come up with a PR to remove n_search_trees at least from the public-facing code if that's helpful (although that would be breaking backwards compatibility, so might not be worth it -- could just document the parameter is deprecated in the docstring?)

lmcinnes commented 3 years ago

A PR to change the docstring would certainly be most welcome.

The idea of scoring the trees against the graph to pick the best one was mine, but it seems the most obvious thing to do once you have to pick a tree. It's also worth noting that I actually reshuffle the data to match the best tree as well, since that's a cheap approximation of a good order to keep the data for the graph as well. I did play around with spectral decompositions of the graph for ordering, but the gains were minimal (if there were any over the tree based ordering) and it was non-trivial so I just dropped it. Perhaps with a little more thought a better plan could be arrived at however, based on expected traversals.