opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
153 stars 113 forks source link

[FEATURE] [META] Introducing High Cardinality Filters or Smart Filters in Vector Search #1140

Open navneet1v opened 1 year ago

navneet1v commented 1 year ago

Introduction

This issue introduces the concept of High Cardinality Filters(HCF) in Vector Search Space and explore the possible solutions on how it can be implement in the current k-NN plugin architecture. The issue will take a shot on setting up the problem statement what is needed to be solved and how it will benefit the users of Vector Search.

Problem Statement

Filtered Vector Search in the current form works well with the changes introduced with OpenSearch 2.10 version of OpenSearch. But this whole approach of getting filtered Ids first and then doing ANN Search while traversing the graph leads to hitting the nodes which will never be part of final results and can leads bad recall if nodes in ground truth are connected with these wrong nodes. This problem will become more visible when the unique values of filter filed is in order of 1000s or 10000s in whole corpus and we have billions of documents in corpus.

Use Case

Lets try to understand with a usecase, consider a company which stores data from which is extracted from different sources like wikipedia, reddit etc. on various topics like networking, AI, LLM etc and index them in OpenSearch cluster. Now there are users which have queries like “What is a TCP protocol” in “networking”. One way to do this is build a query like this:

// Semantic Search
{
   "query": {
        "neural" : {
            "query_string": "What is a TCP protocol",
            "model_id": "someModelId",
            "filter": [
                {
                    "term": {"topic": "networking"}
                }
            ]
        }
    }
}

// which gets converted to K-NN Query Like
{
   "query": {
        "k-nn" : {
            "data_embedding": {
                "vec": [....], // embeddings for the query string
                "filter": [
                    {
                        "term": {"topic": "networking"}
                    }
                ]
            }
        }
    }
}

In all of these cases what is will happen is we will first go ahead and get the filtered document Ids from the lucene inverted index and then we will go ahead and do the ANN search with those filtered ids. This all seems good, but the problem is we are doing a search on larger graph and the graph nodes specific to a filter may be sparsed/distributed in the whole graph recall will drop and latency will shoot up.

Another solution for the whole problem is to build indices per data source, and then at application layer deicide which index to hit the query. This solution can work if you have limited number of data sources. But as data source grow to lets say in order of 1000s or 10000s, we will start to hit the problem of too many indices or may need to create a separate cluster for more indices.

Another solution for this problem can be using a _routing key to route the data for a specific data source to a specific shard. But this solution will suffer from problems like hot shards and scaling the whole cluster because of 1 single shard is getting more queries.

There can be many such usecase where there is some top level field across which data is not shared and queries are most of the time is contains filtering on that field along with more filters

High Level Proposed Solution

The solution which I am proposing here is to introduce some kind of Smart Filters/High Cardinality Filters, which at the indexing time make sure to group the similar documents together at the segment level and then create separate HNSW graphs for all those groups of documents at the segment level. Similarly during the Search, we will look at the filter in that query(along with multiple others) and then choose the appropriate graph to search.

What are High Cardinality Filters(HCF) or Smart Filters?

There is no formal definition of HCF in OpenSearch world(may be there are some literature in open but we haven’t looked at that), but it’s more of a concept that is being introduced with this issue. But I tried to summarize it below:

A high cardinality Filters are the filters that are applied to the fields which has a significant number of unique values in a very large corpus, and as a results when applied during the search consumes significant resources.

Next Steps

Will add more as we start making progress on this issue.

jmazanec15 commented 1 year ago

Really cool feature. Another important use case might be the per-user similarity search.

navneet1v commented 1 year ago

Really cool feature. Another important use case might be the per-user similarity search.

Thanks Jack for suggesting another use case.