run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.24k stars 4.64k forks source link

[Feature Request]: Opensearch efficient filtering #14433

Open spreeni opened 6 days ago

spreeni commented 6 days ago

Feature Description

Opensearch implemented "efficient filters" to apply filtering in an iterative approach dynamically during a kNN-search, not only before or after (see https://opensearch.org/blog/efficient-filters-in-knn/). I think it would be nice if Llamaindex would use this search by default, if a supported engine is used (compatibility can be seen here). At this time, this is supported for lucene (hnsw) and faiss (hnsw, ivf).

Reason

In the current implementation, a "painless script" is used if filters are present. If I understood it correctly, this does not allow for ANN search and should therefore scale much worse to large databases, as it has to calculate scores for every item that matches the filter.

As described in the current implementation in llama_index.vector_stores.opensearch.OpensearchVectorClient._knn_search_query():

If there are no filters do approx-knn search.
If there are (pre)-filters, do an exhaustive exact knn search using 'painless scripting'.

See the below implementation in llama_index.vector_stores.opensearch.OpensearchVectorClient as well:

def __get_painless_scripting_source(
    self, space_type: str, vector_field: str = "embedding"
) -> str:
    """For Painless Scripting, it returns the script source based on space type."""
    source_value = (
        f"(1.0 + {space_type}(params.query_value, doc['{vector_field}']))"
    )
    if space_type == "cosineSimilarity":
        return source_value
    else:
        return f"1/{source_value}"

def _default_painless_scripting_query(
    self,
    query_vector: List[float],
    k: int = 4,
    space_type: str = "l2Squared",
    pre_filter: Optional[Union[Dict, List]] = None,
    vector_field: str = "embedding",
) -> Dict:
    """For Painless Scripting Search, this is the default query."""
    if not pre_filter:
        pre_filter = MATCH_ALL_QUERY

    source = self.__get_painless_scripting_source(space_type, vector_field)
    return {
        "size": k,
        "query": {
            "script_score": {
                "query": pre_filter,
                "script": {
                    "source": source,
                    "params": {
                        "field": vector_field,
                        "query_value": query_vector,
                    },
                },
            }
        },
    }

Value of Feature

If I understood this correctly, the current implementation will make vector-queries with filters unfeasible on large datasets. In the current approach you would have to rely on post-filtering then, which is not reliable in the number of results.

I really like llamaindex and would be happy to see it implemented here. :) I am not an Opensearch expert, but if I have time I could also try to look if I can implement it myself and create a PR.

spreeni commented 6 days ago

I had a look in Haystack's implementation of their OpenSearchDocumentStore and they implement it as follows. But if I see this correctly, it should do simple post-search filtering which is not ideal. Although they state that

Filters are applied during the approximate kNN search to ensure that top_k matching documents are returned.

But according to the Opensearch blog post I linked above, the filters should be inside the knn-query, no?

body = {
    "query": {
        "bool": {
            "must": [
                {
                    "knn": {
                        "embedding": {
                            "vector": query_embedding,
                            "k": top_k,
                        }
                    }
                }
            ],
        }
    },
}

if filters:
    body["query"]["bool"]["filter"] = normalize_filters(filters)

body["size"] = top_k