Algorithm Analysis - Hierarchical Navigable Small World Graphs

Summary

Hierarchical Navigable Small World Graphs (HNSW) is proposed. This is a graph-based algorithm and is similar to KGraph and Small World (SW). Most literature e.g. https://arxiv.org/pdf/1610.02455.pdf (section 8.4.4)suggests these are inferior to HNSW which builds on SW principles. We may still want to look at KGraph and RCT however. The former is intensive to build can produce more variable results.

Resources

Great overview - https://www.youtube.com/watch?v=QvKMwLjdK-s

Overview

Basically, we construct a graph, where each vertex is a vector, and then split it over layers. Vertexes with more neighbours go in higher layers (and all layers beneath). This avoids us getting stuck in low minimal (can happen if we land on a vertex with few neighbours). We navigate the graph by moving to the next nearest vertex i.e. vector, based on euclidean distance. When we hit a minimum (i.e. no other links to take) at a layer, we go down to the next layer - we branch down on configuration number of links - efSearch. We continue until we hit a local minimum at layer 0.

Building the graph is a little more complex - hence the hit on indexing performance. Roughly,

Decide on a layer for the inbound vector. For this we use a probability function at exponentially decreases the chances of the vector being assigned to higher layers. Most vectors go to the bottom layer. L is the number of layers.
If a vertex is assigned to layer N, it is assigned to layers 0 to N.
When assigning to a layer we need to find candidates for the vertexes neighbours. the parameter M, controls the number of neighbours each vertex has (at each layer). Finding candidates uses a standard search from the top layer - friends are added once it reaches its insertion layer. efConstruction is a parameter that can be set and controls how many candidates are considered for friends and used to move between the layers.
Also not uncommon to have Mmax and Mmax0 e.g. 3 and 5 respectively. As more vertexes are added, some existing vertexes can end up with more friends than M. Mmax (might be higher than M) cannot be exceeded and vertexes will be trimmed if this is the case. MMax0 is the same but is for layer 0 - usually higher than Mmax.

The probability of a vector insertion at a given layer is given by a probability function. We normalize this by a parameter m_L known as the "level multiplier". If this is close to 0, all layers are inserted at layer 0. The best performance is achieved when we minimize the overlap of shared neighbours across layers. Decreasing m_L can help minimize overlap (pushing more vectors to layer 0), but this increases the average number of traversals during the search. A rule of thumb for this optimal value is 1/ln(M).

Parameters:

M - number of neighbours
Mmax - max number of neighbour swhen adding links, before being pruned for layers > 0
Mmax0 - max number of neighbours for layer > 0
efSearch - number of vertexes to follow when traversing layers
efConstruction - same as efSearch but used when inserting vectors

Assessement

Using the criteria here:

Are the results approximate or exact? If approximate, how accurate vs exhaustive search? is it tuneable? If we want to tune do we need to reindex?

Results are approximate by design. Most tests suggest recall is excellent and the best quality of most algorithms- see here (figure 5) and here. Higher recall means poorer performance - we would need to tune.

To quote

" When there are sufficient computing resources (both main memory and CPUs) for the off-line index construction, and sufficient main memory to hold the resulting index, DPG and HNSW are the best choices for ANNS on high dimensional data due to their outstanding search performance in terms of robustness to the datasets, result quality, search time and search scalability."

Can be tuned with and efConstruction, efSearch and M - higher values mean more vertexes visited when moving between layers - at the cost of time and memory. if we use higher efConstruction it takes longer to build the index but it means we can use lower efSearch values at query time.

Do we need to perform special operations on the vectors e.g. compression, for the algorithm to work?

No any encoding possible. We could use product quantisation with this for example to lower memory usage.

Does the algorithm support high dimensionality and high document counts? Could we use sparse vectors or would we need to encode to lower dimensionality?

Yes in the order of 100 dimenons - we would need to reduce vector dimensions e.g. using BERT or product quantisation. It will not be effective on sparse vectors. Query performance remains strong for < 100 dimensions due to the small number of comparisons need to find candidates as dimensionality increases. It will suffer from reduced index throughput performance as dimensionality increases but this can be tuned. The index size shouldn't be too impacted by increased dimensionality - it's independent of this other than the size of the vectors themselves to be persisted.

Search performance is good (less than linear) as the number of documents increases. Indexing time and size are linear, however.

Some concerns here on high dimensionality we need to check. Basically the hierarchical structure looses its advantages as d increases - we might need to do something to offset this e.g. product quantisation. Also might not be the case as HNSW benefits from lower intrinsic dimensionality and could be improved by the diversification outlined in https://arxiv.org/pdf/1904.02077.pdf.

Does the algorithm deliver low latency queries and high throughput on large datasets? How does dimensionality impact this? Can it be tuned?

Again tuned with and efConstruction, efSearch and M. High values mean slower performance. We would need to target some SLAs and tune for corpus - balancing against recall. efConstruction is a good choice for improving recall and not impacting search time too heavily re resources and time - at the expense of longer index construction.

How does the algorithm perform for indexing performance - both throughput and final size? How does dimensionality impact this?

This appears to be inferior to hashing techniques and other tree techniques like Annoy with respect to indexing throughput. Relative to all ANN algorithms its a median performance - mainly as insertion is non-trivial.

Size is independent of dimensionality and in this respect, it compresses data well.

The algorithm biases search performance over-index creation. Insertion basically has to do a search which is a lot of work for adding a single vector.

More details here - https://arxiv.org/pdf/1610.02455.pdf - see 8.5.3

Can datapoints be added, updated and removed? Or does the corpus need to be indexed in one go?

yes this algorithm allows incremental addition of points. Incremental construction is inherent to the algorithm. Deletes would need some research.

Can queries be executed concurrently? Can querying occur whilst update operations are occurring? Do we need to support this?

Should be relatively easy to add although pruning may make things a little more tricky. We may need to be careful about removing neighbours during concurrent searches being open.

Memory requirements - do the indexed vectors need to fit entirely in RAM or are disk space structures e.g. with memory mapping, supported?

Default implementation uses a lot of memory. It is controllable through M - at the expense of query recall. We could use product quantisation with this for example to lower memory usage - supporting IVFPQ - all at the expense of recall and search time.

Can indexes be easily serialized to disk to save between restarts?

yes although lazily loading portions off disk - to avoid having to load the entire index into memory could be challenging. We could hold the graph as ids in memory and then read the vectors from an offset file. Ensuring this was sensitive to OS caching will need some thought.

How does the algorithm deal with varying vector dimensions e.g. query might N dimensions and text documents various..or do all vectors have to be the same size? If so, how do we handle it?

Vectors by default are compared on euclidean distance so varying sizes shouldn't be an issue. To encode our vectors we might need a vocabulary - inverted index on disk. High dimensionality will impact memory heavily so I think we'll need to compress our vectors somehow - either using maps and/or quanitization.

Do the algorithms rely on special hardware e.g. GPUs or is commodity hardware fine (I think we prefer)

Works fine on commodity hardware. Not need for GPUs.

Does the algorithm use a custom distance metric or just euclidean distance?

Could use any distance metric we wanted. Euclidean the default.

General suitability for text retrieval.

I think its a good fit. Vespa choose because it works well with query filters - https://medium.com/vespa/approximate-nearest-neighbor-search-in-vespa-part-1-36be5f43b8d1

Any good example implementations?

Lots - https://github.com/facebookresearch/faiss probably the most academic.

saadsharif / ttds-group