run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.64k stars 5.25k forks source link

[Question]: LlamaIndex with OpenSearch: MetadataFilters not working as expected #15433

Closed Blandrust closed 2 months ago

Blandrust commented 2 months ago

Question Validation

Question

I'm using LlamaIndex as a knowledge base with OpenSearch as the vector store. Everything works fine until I try to add MetadataFilters to my query. Here's my current query function:

def query_index(query: str, indexName: str):
    index = vectorstore_cache.load_index(indexName)
    category = select_category(query, indexName)

    filters = MetadataFilters(filters=[
        MetadataFilter(
            key="category", 
            value=category.name
        ),
    ])
    retriever = index.as_retriever(
        choice_batch_size=5,
        filters=filters,
        vector_store_query_mode=VectorStoreQueryMode.HYBRID,
    )
    result = retriever.retrieve(query)
    return result

Without filters, the function retrieves documents as expected. However, when I add the filters, I get this error:

RequestError: RequestError(400, 'x_content_parse_exception', '[category] query malformed, no start_object after query name')

I logged the query sent to OpenSearch, and it looks something like this:

{
    'size': 2, 
    'query': {
        'hybrid': {
            'queries': [
                {
                    'bool': {
                        'must': {'match': {'content': {'query': 'x?'}}},
                        'filter': [{'category': 'y'}]
                    }
                },
                {
                    'script_score': {
                        'query': {'bool': {'filter': [{'category': 'y'}]}},
                        'script': {
                            'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))",
                            'params': {'field': 'embedding', 'query_value': [-0.026000717654824257, ...]}
                        }
                    }
                }
            ]
        }
    }
}

I believe the filter should look more like this:

"term": {
    "category": "y"
}

Is there a way to modify how LlamaIndex constructs the filter query for OpenSearch? Or am I misunderstanding how to use MetadataFilters with OpenSearch? Any help or guidance would be greatly appreciated. I'm not sure if this is a LlamaIndex issue or if I'm missing something in my OpenSearch configuration. Thank you in advance for your assistance!

dosubot[bot] commented 2 months ago

Hey there, @Blandrust! I'm here to help you with any bugs, questions, or contributions. Let's tackle this issue together while we wait for a human maintainer.

Yes, there is a way to modify how LlamaIndex constructs the filter query for OpenSearch. The _parse_filter and _parse_filters_recursively methods in the OpensearchVectorClient class are responsible for converting MetadataFilters into OpenSearch queries. You can modify these methods to change how the filter queries are constructed.

Here is the relevant code snippet from the OpensearchVectorClient class:

def _parse_filter(self, filter: MetadataFilter) -> dict:
    """Parse a single MetadataFilter to equivalent OpenSearch expression.

    As Opensearch does not differentiate between scalar/array keyword fields, IN and ANY are equivalent.
    """
    key = f"metadata.{filter.key}"
    op = filter.operator

    equality_postfix = ".keyword" if self._is_text_field(value=filter.value) else ""
    if op == FilterOperator.EQ:
        return {"term": {f"{key}{equality_postfix}": filter.value}}
    elif op in [
        FilterOperator.GT,
        FilterOperator.GTE,
        FilterOperator.LT,
        FilterOperator.LTE,
    ]:
        return {"range": {key: {filter.operator.name.lower(): filter.value}}}
    elif op == FilterOperator.NE:
        return {
            "bool": {
                "must_not": {"term": {f"{key}{equality_postfix}": filter.value}}
            }
        }
    elif op in [FilterOperator.IN, FilterOperator.ANY]:
        return {"terms": {key: filter.value}}
    elif op == FilterOperator.NIN:
        return {"bool": {"must_not": {"terms": {key: filter.value}}}}
    elif op == FilterOperator.ALL:
        return {
            "terms_set": {
                key: {
                    "terms": filter.value,
                    "minimum_should_match_script": {"source": "params.num_terms"},
                }
            }
        }
    elif op == FilterOperator.TEXT_MATCH:
        return {"match": {key: {"query": filter.value, "fuzziness": "AUTO"}}}
    elif op == FilterOperator.CONTAINS:
        return {"wildcard": {key: f"*{filter.value}*"}}
    else:
        raise ValueError(f"Unsupported filter operator: {filter.operator}")

def _parse_filters_recursively(self, filters: MetadataFilters) -> dict:
    """Parse (possibly nested) MetadataFilters to equivalent OpenSearch expression."""
    condition_map = {FilterCondition.AND: "must", FilterCondition.OR: "should"}

    bool_clause = condition_map[filters.condition]
    bool_query: dict[str, dict[str, list[dict]]] = {"bool": {bool_clause: []}}

    for filter in filters.filters:
        if isinstance(filter, MetadataFilters):
            bool_query["bool"][bool_clause].append(self._parse_filters_recursively(filter))
        else:
            bool_query["bool"][bool_clause].append(self._parse_filter(filter))

    return bool_query

You can customize these methods to fit your specific requirements for constructing filter queries in OpenSearch.

Additionally, the LlamaIndex documentation provides examples of how to use the ElasticsearchStore class and apply metadata filters. For instance, you can use the ExactMatchFilter and MetadataFilters classes to filter documents based on metadata:

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

metadata_store = ElasticsearchStore(
    es_url="http://localhost:9200",  # for Elastic Cloud authentication see above
    index_name="movies_metadata",
)
storage_context = StorageContext.from_defaults(vector_store=metadata_store)
index = VectorStoreIndex(movies, storage_context=storage_context)
# Metadata filter
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="theme", value="Mafia")]
)
retriever = index.as_retriever(filters=filters)

results = retriever.retrieve("What is inception about?")
print_results(results)

For more advanced custom filters, you can override the query by passing a custom query function:

def custom_query(query, query_str):
    print("custom query", query)
    return query

query_engine = index.as_query_engine(
    vector_store_kwargs={
        "es_filter": [{"match": {"title": "matrix"}}],
        "custom_query": custom_query,
    }
)
query_engine.query("what is this movie about?")

This allows you to leverage the full range of filtering capabilities provided by Elasticsearch.

Moreover, the test file test_opensearch_client.py in the LlamaIndex repository provides practical examples of how to use these filters in various scenarios, including nested filters and different filter operators. This can serve as a useful reference for understanding and customizing the filter query construction process [1][2][3].

To continue talking to Dosu, mention @dosu.

sad-mathematician commented 2 months ago

@Blandrust There was a bug in how llama_index handled metadata filtering with opensearch. I believe it has been fixed last week. So your problem should be solved if you simply upgrade to the latest llama_index version. If you don't want to upgrade right now, you will need to update the _parse_filters method of the OpensearchVectorClient class where the problem is. The filter should look like:

'filter': [{'term': {'metadata.category': 'y'}}]

It needs the term key around the metadata with metadata. added as a prefix to the filter key.

from llama_index.vector_stores.opensearch import OpensearchVectorStore, OpensearchVectorClient

# Creating a new _parse_filters method to fix the default methods
def new_parse_filters(self, filters: Optional[MetadataFilters]) -> Any:
        pre_filter = []
        # print(filters)
        if filters is not None:
            for f in filters.legacy_filters():
                if isinstance(f.value, list):
                    pre_filter.append({"terms": {f.key: f.value}})
                elif isinstance(f.value, str):
                    pre_filter.append({"term": {f.key: f.value}})
                else:
                    pre_filter.append({f.key: json.loads(str(f.value))})

        return pre_filter

OpensearchVectorClient._parse_filters = new_parse_filters

Adding this will update the _parse_filters method and format the filters properly for now. Remember, before the recent fix, in MetadataFilter, the key had to be metadata. followed by the name. So metadata.category in your case. Though you don't have to do this if you just upgrade the llama_index version.

note: the answer given by dosu bot is based on this latest version where the metadata filtering has been fixed and might not be relevant to your question about a previous version. Upgrading would be the best way to solve your problem.

logan-markewich commented 2 months ago

This is solved if you update: pip install -U llama-index-vector-stores-opensearch