saadsharif / ttds-group

TTDS Group Project
3 stars 0 forks source link

Algorithm Candidates #2

Open gingerwizard opened 2 years ago

gingerwizard commented 2 years ago

We can split ANN algorithms into three distinct categories; trees, hashes, and graphs.

The following represent possible algorithmic approaches. For each approach there are typically variants. Note these focus on techniques that support low dimensionality (100ish max) I believe and this require encoding techniques for the text.

  1. Tree-based space partition methods e.g. Annoy, Annoy2 (latter supports updates I believe) and FLANN.
  2. Product Quanitization techniques e.g. using k-means. A few approaches here, some use IndexPQ others encode offsets in a technique called "Product Quantization With Inverted File Index" - IVFPQ https://www.youtube.com/watch?v=BMYBwbkbVec and https://www.youtube.com/watch?v=AQau4-VF64w maybe useful. Also https://www.pinecone.io/learn/product-quantization. Optimized Product Quantization (OPO) seems to be the current leader.
  3. Locality Sensitive Hashing methods - https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134 numerous algorithms here inc. e.g. Fly, SRS and QALSH. Maybe do quick research on the latest and pick one.
  4. Inverted File Index Intuition - need a specific algorithm here. Its a technique using Voronoi cells)
  5. Neighbourhood graph based methods - KGraph, Hierarchical Navigable Small World Graphs Dale @gingerwizard https://github.com/saadsharif/ttds-group/issues/3

Sparse vector techniques (I don't know if we want to go down this route). It would be simpler to implement:

  1. https://arxiv.org/abs/1011.2807
  2. Searching on sparse vectors - https://github.com/facebookresearch/pysparnn
  3. Local sensitive hashing techniques e.g. SRS and OPO might work. Others more unknown:

Diversified Proximity Graphs Navigating Spreading Out Graph (NSG) - https://arxiv.org/abs/1707.00143

gingerwizard commented 2 years ago

Useful https://towardsdatascience.com/using-approximate-nearest-neighbor-search-in-real-world-applications-a75c351445d - interesting in that post-filtering is not ideal e.g. by date or a field.

gingerwizard commented 2 years ago

https://arxiv.org/abs/1610.02455 - sections 2-6 cover most algorithm options

gingerwizard commented 2 years ago

I think inevitably we will mix algorithms - see https://www.pinecone.io/learn/composite-indexes/