Closed jlmelville closed 3 years ago
I think it may be dead code. I would have to go through and check in detail, but for a while using less trees in search seemed like a good idea, but I wasn't sure how many were needed, so I made it a parameter. I certainly found that ultimately the first tree is really all that is needed, and so some of the code may now assume that. I obviously didn't propagate that all the way through apparently. For search one tree is enough though.
Is the idea of scoring the forest against the NND-optimized neighbor graph to find the single best tree a McInnes innovation? That's a much better and simpler idea than what I was grasping at in my question in #110 about refining the trees from the NND result.
Anyway, for some practical actions from this, I could probably come up with a PR to remove n_search_trees
at least from the public-facing code if that's helpful (although that would be breaking backwards compatibility, so might not be worth it -- could just document the parameter is deprecated in the docstring?)
A PR to change the docstring would certainly be most welcome.
The idea of scoring the trees against the graph to pick the best one was mine, but it seems the most obvious thing to do once you have to pick a tree. It's also worth noting that I actually reshuffle the data to match the best tree as well, since that's a cheap approximation of a good order to keep the data for the graph as well. I did play around with spectral decompositions of the graph for ordering, but the gains were minimal (if there were any over the tree based ordering) and it was non-trivial so I just dropped it. Perhaps with a little more thought a better plan could be arrived at however, based on expected traversals.
Looking at the rp trees code it seems that
n_search_trees
may no longer do anything: only the best-scoring tree is used in search.I used the F-MNIST notebook to generate some extra data (warning: you will have to wait around ten minutes to use the
sklearn
kd-tree implementation to get at the ground truth if you run it this way):I have purposely made the nearest neighbor descent and search routines not perform well so they don't swamp the effect of the tree initialization. Repeating with
n_search_trees=10
:There seems to be no effect whatsoever. The
_search_forest
field goes through various transformations at different points in the code so I had trouble fully confirming what's going on, but looking at_init_search_function
: https://github.com/lmcinnes/pynndescent/blob/ee915d2599c1fd36e13ae4ac15ba91ecee503c86/pynndescent/pynndescent_.py#L1131-L1139It does seem that only the first search tree data is used (
_search_forest
is not read from again in that method or the internally-defined functions).