[RFC] Implement pruning for neural sparse search

zhichao-aws commented 5 days ago

Background

Neural Sparse is a semantic search method which is built on native Lucene inverted index. The documents and queries are encoded into sparse vectors, where the entry represents the token and their corresponding semantic weight.

Since the model expands the tokens with semantic weights during the encoding process, the number of tokens in the sparse vectors is often greater than the original raw text. Additionally, the token weights in the sparse vectors exhibit a significant long-tail distribution, where tokens with lower semantic importance occupy a large portion of the storage space. In the experiments of this blog, we found that the index sizes produced by the two modes of neural sparse search are 4.7 times and 6.8 times larger than the BM25 inverted index.

Pruning can effectively alleviate this problem. During the process of ingestion and search, we prune the sparse vectors according to different strategies, removing tokens with relatively small weights. Research has shown that even simple pruning strategies can significantly reduce index size while preserving most of the search accuracy[1]. This can help users achieve a better balance between search accuracy and cost.

What are we going to do?

Implement pruning at sparse_encoding ingestion processor. Users can configure the pruning strategy when create the processor, and the processor will prune the sparse vectors before write to index.
Implement pruning at neural_sparse query clause. Users can configure the pruning strategy when search with neural_sparse query. The query builder will prune the query before search on index.
Give a quantitative analysis of the pruning strategies. Users can refer to the analysis results to configure the pruning strategy, thereby achieving the desired balance between accuracy and cost.

Pruning strategy

We propose to implent these 4 pruning strategies:

Pruning by weight threshold

For this method, given a threshold T, all tokens whose weight is smaller than T will be pruned.

def prune_by_weight_threshold(sparse_vector, threshold):
    pruned_vector = {}
    for token, weight in sparse_vector.items():
        if weight >= threshold:
            pruned_vector[token] = weight
    return pruned_vector

Pruning by ratio with max weight

For this method, given a sparse vector X, we first find the max weight of X, and calculate the weight ratio of every token and the max weight. If the ratio is smaller than threshold T, it will be pruned.

def prune_by_ratio_with_max_weight(sparse_vector, threshold):
    pruned_vector = {}
    max_weight = max(sparse_vector.values())
    for token, weight in sparse_vector.items():
        ratio = weight / max_weight
        if ratio >= threshold:
            pruned_vector[token] = weight
    return pruned_vector

Pruning by Top-K

For this method, given a sparse vector S, we first sort the tokens based on their weights, from large to small. And we only keep the tokens with Top-K values.

def prune_by_top_k(sparse_vector, k):
    sorted_vector = sorted(sparse_vector.items(), key=lambda x: x[1], reverse=True)
    pruned_vector = dict(sorted_vector[:k])
    return pruned_vector

Pruning by alpha-mass[2]

For this method, given a sparse vector S, we first sort the tokens based on their weights, from large to small. We iterate on the vector entries and record the accumulated values, until the ratio of accumulated values and sum of all values is larger than threshold T. And the non-iterated entries are dropped.

def prune_by_alpha_mass(sparse_vector, threshold):
    sorted_vector = sorted(sparse_vector.items(), key=lambda x: x[1], reverse=True)
    pruned_vector = {}
    accumulated_mass = 0
    total_mass = sum(sparse_vector.values())
    for token, weight in sorted_vector:
        accumulated_mass += weight
        if accumulated_mass / total_mass >= threshold:
            break
        pruned_vector[token] = weight
    return pruned_vector

API

To create an ingest processor with pruning:

PUT /_ingest/pipeline/sparse-pipeline
{
    "description": "Calling sparse model to generate expanded tokens",
    "processors": [
        {
            "sparse_encoding": {
                "model_id": "fousVokBjnSupmOha8aN",
                "field_map": {
                    "body": "body_sparse"
                },
                "pruning_config":{
                    "pruning_type": "alpha_mass",
                    "threshold": 0.8
                }
            }
        }
    ]
}

To search with pruning:

GET /test-index/_search
{
    "query": {
        "neural_sparse": {
            "body_sparse": {
                "query_text": "i be going",
                "model_id": "fousVokBjnSupmOha8aN",
                "pruning_config":{
                    "pruning_type": "alpha_mass",
                    "threshold": 0.8
                }
            }
        }
    }
}

References

[1]: A Static Pruning Study on Sparse Neural Retrievers

[2]: Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations

zhichao-aws commented 5 days ago

To make the ingest processor work for raw sparse vectors ingestion, one prerequiste is to implement this issue: #793 . We can add a parameter to configure whether call model inference for raw vectors ingestion

zhichao-aws commented 5 days ago

Another tricky part is the combination with 2-phase search. My thought on the proper behavior is we first prune, then split the queries to two phase. Please feel free to put more comments about this.

vibrantvarun commented 5 days ago

@zhichao-aws Can you add some context on what is pruning?

zhichao-aws commented 1 day ago

Can you add some context on what is pruning?

In the context of sparse vector representations, pruning is a technique used to reduce the size of the sparse vectors by removing or "pruning" tokens that have relatively low semantic weights or importance.

In neural sparse search, documents and queries are encoded into sparse vectors, where each entry represents a token and its corresponding semantic weight. For example: "hello world" -> {"hello": 1.1, "world":1.2, "hi": 0.9, "planet": 0.1, "greeting": 0.5, "earth":0.15} (just for example, not real encoding result). For pruning purpose, we can remove planet and earth, to reduce the storage and increase search speed.

By applying pruning strategies, users can achieve a balance between search accuracy and storage costs, as research has shown that even simple pruning strategies can significantly reduce index size while preserving most of the search accuracy.

opensearch-project / neural-search