lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.46k stars 809 forks source link

HNSW support #213

Open andrewluetgers opened 5 years ago

andrewluetgers commented 5 years ago

Just curious if the HNSW algorithm is amenable to use in this project? Probably not directly because I see you are jitting in the distance metric but curious if it is worth a look into that algorithm for the nn search vs random projections? It is consistently the top performing algorithm on the ann shootout and I was wondering if the performance of UMAP is significantly impacted by the NN component and if so would incorporating HNSW make a meaningful difference?

lmcinnes commented 5 years ago

Thanks for the suggestions @andrewluetgers. I have spent a little bit of time looking into this, and for now haven't quite reached a tipping point of implementing HNSW for UMAP.

At the end of the day, UMAP needs an approximate knn-graph. One way to get that is to use an approximate nearest neighbor algorithm (like NN-descent or HNSW) to find the k-nearest-neighbors of every point. Now such algorithms have two steps: the first is index construction, when you build a data structure that will help you in the search; and the second is querying, where you make use of the index to perform an approximate nearest neighbor search for a query point. HNSW is far and away the fastest algorithm at the querying phase -- once you've constructed the index the process of finding approximate nearest neighbors for query points is extremely efficient. It is the querying phase that the ANN-benchmarks measures (as this is the primary task for ANN problems; index construction is usually a one-off). The catch is that index construction for HNSW is only in the "pretty fast" category. In contrast the index that NN-descent build is the knn-graph. So for NN-descent in UMAP we actually don't need the search phase, we just need to construct the index and we are done. That tends to give NN-descent an advantage that is apparent from ANN-benchmarks alone.

At the same time, I like HNSW, and as you note, it is the best performing algorithm for querying. It is possible that the full index construction and query might still be faster than NN-descent index construction. It requires a numba-jitted implementation though, and that would take a while to write ... so for now I haven't quite gotten to it.

jlmelville commented 5 years ago

FWIW I have looked at replacing Annoy with HNSW in uwot, but I wasn't able to find settings where the increase in query speed offset the increase in index building.

andrewluetgers commented 5 years ago

@lmcinnes wow thanks for the extremely informative response. I never considered the index time vs query time. Sounds like there is little to gain for a lot of work.