saadsharif / ttds-group

TTDS Group Project
3 stars 0 forks source link

Selection Criteria for Algorithm #1

Open gingerwizard opened 2 years ago

gingerwizard commented 2 years ago

This attempts to consider some of the criteria we should consider when selecting algorithm:

  1. Are the results approximate or exact? If approximate, how accurate vs exhaustive search? is it tuneable? If we want to tune do we need to reindex?
  2. Do we need to perform special operations on the vectors e.g. compression, for the algorithm to work?
  3. Does the algorithm support high dimensionality and high document counts? Could we use sparse vectors or would we need to encode to lower dimensionality?
  4. Does the algorithm deliver low latency queries and high throughput on large datasets? How does dimensionality impact this? Can it be tuned?
  5. How does the algorithm perform for indexing performance - both throughput and final size? How does dimensionality impact this?
  6. Can datapoints be added, updated and removed? Or does the corpus need to be indexed in one go?
  7. Can queries be executed concurrently? Can querying occur whilst update operations are occurring? Do we need to support this?
  8. Memory requirements - do the indexed vectors need to fit entirely in RAM or are disk space structures e.g. with memory mapping, supported?
  9. Can indexes be easily serialized to disk to save between restarts?
  10. How does the algorithm deal with varying vector dimensions e.g. query might N dimensions and text documents various..or do all vectors have to be the same size? If so, how do we handle?
  11. Do the algorithms rely on special hardware e.g. GPUs or is commodity hardware fine (I think we prefer)
  12. Does the algorithm use a custom distance metric or just euclidean distance?
  13. General suitability for text retrieval.
  14. Any good example implementations?

We should in turn state our requirements, based on the above, and filter to those algorithms which satisfy them.

gingerwizard commented 2 years ago

One way of looking at selection - it doesn't consider all algorithms. Taken from https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6

image

gingerwizard commented 2 years ago

Useful overviews -

https://arxiv.org/abs/1610.02455 - sections 2-6 cover most algorithm options

https://arxiv.org/abs/1612.07545 - focus on hashing options

gingerwizard commented 2 years ago

Interesting analysis https://medium.com/vespa/approximate-nearest-neighbor-search-in-vespa-part-1-36be5f43b8d1. Vespa is probably the leading vector search product out there.

gingerwizard commented 2 years ago

This is super useful! https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

enzo-inc commented 2 years ago

ANN algorithms benchmarking: http://ann-benchmarks.com/