run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.5k stars 5.01k forks source link

[Bug]: Opensearch Serverless throwing compile error when trying to filter using MetadataFilter object #15560

Closed GeorgeDittmar closed 1 week ago

GeorgeDittmar commented 4 weeks ago

Bug Description

Hello I am running into an odd situation where when I try to connect up to opensearch serverless in AWS and attempt to filter on some metadata, in my case a field called index_type, it returns the following error RequestError: RequestError(400, 'search_phase_execution_exception', 'compile error')

I spent some time pulling out the OpensearchVectorClient and OpensearchVectorStore classes since I didnt see a good way to extract the query it was producing, but once I did and logged the query it generated this is what it attempts to build which is breaking. I also ran this query directly against opensearch-py library and got the same error as above so I am guessing the query being generated is incorrect some how. If I just do a basic "scan" query and filter on the metadata.index_type.keyword field it returns fine but the llamaindex generated query seems off.

Here is the query it generates before it attempts to query Opensearch serverless under the hood. I removed the values in the embedding field to save on screen real estate. the following query is the one that creates that compile error.

{'size': 6, 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'term': {'metadata.index_type.keyword': 'smart_chunk'}}]}}]}}, 'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))", 'lang': 'painless', 'params': {'field': 'embedding', 'query_value': []}}}}}

But this gets weirder when I modify the Metadatafilter key values to be metadata.index_type instead of just index_type it produces the following query which then runs fine but doesnt return any results, due to it not finding anything to match that filter I assume.

{'size': 6, 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'term': {' metadata.metadata.index_type.keyword': 'smart_chunk'}}]}}]}}, 'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))", 'lang': 'painless', 'params': {'field': 'embedding', 'query_value': []}}}}}

I am at a loss as to best sort out where in the query it is wrong, need to get better at understanding OS query language, or if its something about the index? we used llamaindex to create the index so I would have assumed that was taken care of correctly under the hood there. Any thoughts or ideas? Also why is the OpensearchClient Async only? I cant seem to be able to run the client under a fastapi service layer due to event loop issues, but I can deal with that later.

Version

0.10.68

Steps to Reproduce

Here is the code that tries to call the vector store and retrieve results from opensearch

vector_store = OpensearchVectorStore(client=self.vector_client)
index = VectorStoreIndex.from_vector_store(vector_store)
smart_chunk_filters = MetadataFilters(
    filters=[
        MetadataFilter(key="index_type", value="smart_chunk"),
    ]
)
query_engine = index.as_retriever(similarity_top_k=k, filters=smart_chunk_filters)
node_postprocessors = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)
result = query_engine.retrieve(query)

Relevant Logs/Tracbacks

backend-1  | opensearchpy.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'compile error')
dosubot[bot] commented 4 weeks ago

The compile error you're encountering when using the MetadataFilter object in Opensearch Serverless is likely due to incorrect usage or missing imports. Here is the correct way to use the MetadataFilter and MetadataFilters objects:

from llama_index.core.vector_stores.types import MetadataFilter, MetadataFilters, FilterOperator

# Example usage of MetadataFilter
filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="index_type",
            value="smart_chunk",
            operator=FilterOperator.EQ,
        )
    ]
)

# Create a query with the filters
query = VectorStoreQuery(
    query_embedding=[0.1, 0.2, 0.3],  # Example embedding
    similarity_top_k=10,
    filters=filters
)

# Assuming os_store is an instance of OpensearchVectorStore
query_result = os_store.query(query)

Ensure you have the necessary imports and that the MetadataFilter and MetadataFilters objects are correctly instantiated and used in your query. Additionally, make sure you are using the latest version of the llama-index-vector-stores-opensearch package, as recent updates have added support for full MetadataFilters (all operators and nested filters) and removed the necessity to prefix filter keys with "metadata" [1][2].

If you follow the above example and still encounter issues, please provide more specific details about the compile error message. Additionally, ensure that your OpensearchVectorClient and OpensearchVectorStore are correctly set up as shown in the documentation:

from llama_index.vector_stores.opensearch import (
    OpensearchVectorStore,
    OpensearchVectorClient,
)

# http endpoint for your cluster (opensearch required for vector index usage)
endpoint = "http://localhost:9200"
# index to demonstrate the VectorStore impl
idx = "gpt-index-demo"

# OpensearchVectorClient stores text in this field by default
text_field = "content"
# OpensearchVectorClient stores embeddings in this field by default
embedding_field = "embedding"

# OpensearchVectorClient encapsulates logic for a
# single opensearch index with vector search enabled
client = OpensearchVectorClient(
    endpoint, idx, 1536, embedding_field=embedding_field, text_field=text_field
)

# initialize vector store
vector_store = OpensearchVectorStore(client)

This setup ensures that your Opensearch index is correctly configured and ready to handle vector queries with metadata filters [1][3].

For a more detailed example, you can refer to the Alibaba Cloud OpenSearch Vector Store documentation, which includes steps for setting up and querying with metadata filters [4].

To continue talking to Dosu, mention @dosu.

GeorgeDittmar commented 4 weeks ago

@dosu I still get the same error when adding the operator field. opensearchpy.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'compile error')

dosubot[bot] commented 4 weeks ago

The error opensearchpy.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'compile error') is likely due to an issue in the query construction, specifically in the MetadataFilter parsing logic. Even though you have added the operator field to the MetadataFilter, the error suggests that there might be a problem with how the filter is being translated into an OpenSearch query.

In the provided code, the _parse_filter method is responsible for converting a MetadataFilter into an OpenSearch query expression. Here is the relevant part of the code:

def _parse_filter(self, filter: MetadataFilter) -> dict:
    """Parse a single MetadataFilter to equivalent OpenSearch expression.

    As Opensearch does not differentiate between scalar/array keyword fields, IN and ANY are equivalent.
    """
    key = f"metadata.{filter.key}"
    op = filter.operator

    equality_postfix = ".keyword" if self._is_text_field(value=filter.value) else ""
    if op == FilterOperator.EQ:
        return {"term": {f"{key}{equality_postfix}": filter.value}}
    elif op in [
        FilterOperator.GT,
        FilterOperator.GTE,
        FilterOperator.LT,
        FilterOperator.LTE,
    ]:
        return {"range": {key: {filter.operator.name.lower(): filter.value}}}
    elif op == FilterOperator.NE:
        return {
            "bool": {
                "must_not": {"term": {f"{key}{equality_postfix}": filter.value}}
            }
        }
    elif op in [FilterOperator.IN, FilterOperator.ANY]:
        return {"terms": {key: filter.value}}
    elif op == FilterOperator.NIN:
        return {"bool": {"must_not": {"terms": {key: filter.value}}}}
    elif op == FilterOperator.ALL:
        return {
            "terms_set": {
                key: {
                    "terms": filter.value,
                    "minimum_should_match_script": {"source": "params.num_terms"},
                }
            }
        }
    elif op == FilterOperator.TEXT_MATCH:
        return {"match": {key: {"query": filter.value, "fuzziness": "AUTO"}}}
    elif op == FilterOperator.CONTAINS:
        return {"wildcard": {key: f"*{filter.value}*"}}
    else:
        raise ValueError(f"Unsupported filter operator: {filter.operator}")

Ensure that the operator field in your MetadataFilter is correctly set and that the value is one of the supported FilterOperator values. If the operator is not recognized, it will raise a ValueError, which might be causing the compile error in the query.

Additionally, verify that the value field in the MetadataFilter is correctly formatted and appropriate for the specified operator. For example, if you are using a range operator (e.g., GT, GTE), ensure that the value is a number.

If the issue persists, you might want to add logging or print statements to debug the exact query being generated and sent to OpenSearch. This can help identify any discrepancies or formatting issues in the query.

For further reference, you can look at how the LlamaIndex framework handles similar filter transformations for Couchbase. The _transform_couchbase_filter_operator function converts standard metadata filter operators to Couchbase-specific filter operations, ensuring that the operators and values are correctly formatted for the target database:

def _transform_couchbase_filter_operator(
    operator: str, field: str, value: Any
) -> Dict[str, Any]:
    """
    Convert standard metadata filter operator to Couchbase specific filter operation.

    Args:
        operator: standard metadata filter operator
        field: metadata field
        value: value to apply for the filter

    Returns:
        Dictionary with Couchbase specific search operation.
    """
    if operator == "!=":
        return {"must_not": {"disjuncts": [{"field": field, "match": value}]}}
    elif operator == "==":
        return {"field": field, "match": value}
    elif operator == ">":
        return {"min": value, "inclusive_min": False, "field": field}
    elif operator == "<":
        return {"max": value, "inclusive_max": False, "field": field}
    elif operator == ">=":
        return {"min": value, "inclusive_min": True, "field": field}
    elif operator == "<=":
        return {"max": value, "inclusive_max": True, "field": field}
    elif operator == "text_match":
        return {"match_phrase": value, "field": field}
    else:
        raise ValueError(f"Filter operator {operator} not supported")

This approach ensures that the filter operators and values are correctly interpreted by the target database, reducing the likelihood of compile errors [1][2].

To continue talking to Dosu, mention @dosu.

GeorgeDittmar commented 4 weeks ago

@dosu here is the AWS Opensearch query from the logs I have I think it is trying to build and throwing the error with. I shortened the query_value field to just a few floats for brevity.

{'size': 6, 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'term': {'metadata.index_type.keyword': 'smart_chunk'}}]}}]}}, 'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))", 'params': {'field': 'embedding', 'query_value': [-0.11328125, -0.072753906, 0.36523438]}}}}}
dosubot[bot] commented 4 weeks ago

The compile error in your AWS OpenSearch query is due to the incorrect usage of the l2Squared function in the script. The correct syntax for using the l2Squared function in a Painless script should be:

"1 / (1.0 + l2Squared(params.query_value, doc['embedding']))"

Here is the corrected query:

{
  "size": 6,
  "query": {
    "script_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "bool": {
                "must": [
                  {
                    "term": {
                      "metadata.index_type.keyword": "smart_chunk"
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "script": {
        "source": "1 / (1.0 + l2Squared(params.query_value, doc['embedding']))",
        "params": {
          "field": "embedding",
          "query_value": [-0.11328125, -0.072753906, 0.36523438]
        }
      }
    }
  }
}

This should resolve the compile error [1].

To continue talking to Dosu, mention @dosu.

GeorgeDittmar commented 4 weeks ago

@dosu I do not think that is it. the query you produced is the same as the one I posted just with a few extra white spaces. I tried to copy it in my query just to be sure but still errored. here is the query I tried.

{'size': 6, 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'term': {'metadata.index_type.keyword': 'smart_chunk'}}]}}]}}, 'script': {'source': "1 / (1.0 + l2Squared(params.query_value, doc['embedding']))", 'params': {'field': 'embedding', 'query_value': [-0.11328125, -0.072753906]}}}}}
dosubot[bot] commented 4 weeks ago

The compile error in your OpenSearch query is likely due to the incorrect handling of the value attribute in the _parse_filter method. Specifically, the value attribute is accessed without being defined, which can cause a compile error.

Here is the problematic part of the code:

def _parse_filter(self, filter: MetadataFilter) -> dict:
    """Parse a single MetadataFilter to equivalent OpenSearch expression.

    As Opensearch does not differentiate between scalar/array keyword fields, IN and ANY are equivalent.
    """
    key = f"metadata.{filter.key}"
    op = filter.operator

    equality_postfix = ".keyword" if self._is_text_field(value=filter.value) else ""
    if op == FilterOperator.EQ:
        return {"term": {f"{key}{equality_postfix}": filter.value}}
    elif op in [
        FilterOperator.GT,
        FilterOperator.GTE,
        FilterOperator.LT,
        FilterOperator.LTE,
    ]:
        return {"range": {key: {filter.operator.name.lower(): filter.value}}}
    elif op == FilterOperator.NE:
        return {
            "bool": {
                "must_not": {"term": {f"{key}{equality_postfix}": filter.value}}
            }
        }
    elif op in [FilterOperator.IN, FilterOperator.ANY]:
        return {"terms": {key: filter.value}}
    elif op == FilterOperator.NIN:
        return {"bool": {"must_not": {"terms": {key: filter.value}}}}
    elif op == FilterOperator.ALL:
        return {
            "terms_set": {
                key: {
                    "terms": filter.value,
                    "minimum_should_match_script": {"source": "params.num_terms"},
                }
            }
        }
    elif op == FilterOperator.TEXT_MATCH:
        return {"match": {key: {"query": filter.value, "fuzziness": "AUTO"}}}
    elif op == FilterOperator.CONTAINS:
        return {"wildcard": {key: f"*{filter.value}*"}}
    else:
        raise ValueError(f"Unsupported filter operator: {filter.operator}")

The value attribute is accessed in the line:

equality_postfix = ".keyword" if self._is_text_field(value=filter.value) else ""

To fix this, ensure that the value attribute is correctly accessed from the filter object. The corrected code should look like this:

def _parse_filter(self, filter: MetadataFilter) -> dict:
    """Parse a single MetadataFilter to equivalent OpenSearch expression.

    As Opensearch does not differentiate between scalar/array keyword fields, IN and ANY are equivalent.
    """
    key = f"metadata.{filter.key}"
    op = filter.operator
    value = filter.value  # Ensure value is correctly accessed

    equality_postfix = ".keyword" if self._is_text_field(value=value) else ""
    if op == FilterOperator.EQ:
        return {"term": {f"{key}{equality_postfix}": value}}
    elif op in [
        FilterOperator.GT,
        FilterOperator.GTE,
        FilterOperator.LT,
        FilterOperator.LTE,
    ]:
        return {"range": {key: {filter.operator.name.lower(): value}}}
    elif op == FilterOperator.NE:
        return {
            "bool": {
                "must_not": {"term": {f"{key}{equality_postfix}": value}}
            }
        }
    elif op in [FilterOperator.IN, FilterOperator.ANY]:
        return {"terms": {key: value}}
    elif op == FilterOperator.NIN:
        return {"bool": {"must_not": {"terms": {key: value}}}}
    elif op == FilterOperator.ALL:
        return {
            "terms_set": {
                key: {
                    "terms": value,
                    "minimum_should_match_script": {"source": "params.num_terms"},
                }
            }
        }
    elif op == FilterOperator.TEXT_MATCH:
        return {"match": {key: {"query": value, "fuzziness": "AUTO"}}}
    elif op == FilterOperator.CONTAINS:
        return {"wildcard": {key: f"*{value}*"}}
    else:
        raise ValueError(f"Unsupported filter operator: {filter.operator}")

This ensures that the value attribute is correctly accessed and used in the filter parsing logic, which should resolve the compile error [1].

Additionally, ensure that the script_score function is correctly implemented and integrated with the OpenSearch client. The llama-index-legacy/llama_index/legacy/vector_stores/opensearch.py file from LlamaIndex provides a comprehensive set of functions and classes that can help verify the functionality of your OpenSearch integration. Make sure to review and utilize these functions, such as _default_painless_scripting_query and _knn_search_query, to confirm that your implementation works as expected [2].

To continue talking to Dosu, mention @dosu.

GeorgeDittmar commented 4 weeks ago

@dosu I made the changes suggested yet it still generates the following query for AWS Opensearch and throws a compile error like so.

backend-1  |   File "/usr/local/lib/python3.11/site-packages/opensearchpy/client/utils.py", line 176, in _wrapped
backend-1  |     return func(*args, params=params, headers=headers, **kwargs)
backend-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
backend-1  |   File "/usr/local/lib/python3.11/site-packages/opensearchpy/client/__init__.py", line 2364, in search
backend-1  |     return self.transport.perform_request(
backend-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
backend-1  |   File "/usr/local/lib/python3.11/site-packages/opensearchpy/transport.py", line 455, in perform_request
backend-1  |     raise e
backend-1  |   File "/usr/local/lib/python3.11/site-packages/opensearchpy/transport.py", line 416, in perform_request
backend-1  |     status, headers_response, data = connection.perform_request(
backend-1  |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
backend-1  |   File "/usr/local/lib/python3.11/site-packages/opensearchpy/connection/http_requests.py", line 238, in perform_request
backend-1  |     self._raise_error(
backend-1  |   File "/usr/local/lib/python3.11/site-packages/opensearchpy/connection/base.py", line 315, in _raise_error
backend-1  |     raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
backend-1  | opensearchpy.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'compile error')

Here is the query that was generated by LLamaindex code

{'size': 6, 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'term': {'metadata.index_type.keyword': 'smart_chunk'}}]}}]}}, 'script': {'source': "1 / (1.0 + l2Squared(params.query_value, doc['embedding']))", 'params': {'field': 'embedding', 'query_value': [-0.11328125, -0.072753906, 0.36523438]}}}}}

Here is a schema of the index I am trying to run against. Is there something wrong with the setup of the index or could it be the generation of the query?

{'test-index': {'mappings': {'properties': {'content': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'embedding': {'type': 'knn_vector',
     'dimension': 1536,
     'method': {'engine': 'faiss',
      'space_type': 'l2',
      'name': 'hnsw',
      'parameters': {}}},
    'id': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'metadata': {'properties': {'_node_content': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      '_node_type': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'doc_id': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'document_id': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'index_type': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'original_text': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'ref_doc_id': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
      'window': {'type': 'text',
       'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}}}}
dosubot[bot] commented 4 weeks ago

The compile error you're encountering is due to the l2Squared function not being supported in OpenSearch's Painless scripting language. This is a known issue that can cause compile errors when used in your script.

To resolve this, you need to use an alternative method for calculating the distance between vectors. One common approach is to manually implement the L2 distance calculation in the Painless script. Here is how you can modify your query to avoid using the l2Squared function:

{
  "size": 6,
  "query": {
    "script_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "bool": {
                "must": [
                  {
                    "term": {
                      "metadata.index_type.keyword": "smart_chunk"
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "script": {
        "source": """
          double l2 = 0;
          for (int i = 0; i < params.query_value.length; ++i) {
            l2 += Math.pow(params.query_value[i] - doc[params.field][i], 2);
          }
          return 1 / (1.0 + l2);
        """,
        "params": {
          "field": "embedding",
          "query_value": [-0.11328125, -0.072753906, 0.36523438]
        }
      }
    }
  }
}

This script manually calculates the L2 distance by iterating over the elements of the query vector and the document vector, summing the squared differences, and then applying the same transformation as before.

Make sure to update your query generation code to use this new script. This should resolve the compile error and allow your query to execute correctly in AWS OpenSearch.

For reference, the issue with l2Squared not being supported in Painless scripting is documented in the LlamaIndex repository [1].

To continue talking to Dosu, mention @dosu.

GeorgeDittmar commented 4 weeks ago

@dosu According to Opensearch docs that is not correct.

dosubot[bot] commented 4 weeks ago

To resolve the compile error in your OpenSearch query when using the MetadataFilter object and the l2Squared function, ensure that the value attribute of the MetadataFilter object is correctly referenced in the _parse_filter method. The current implementation does not correctly reference filter.value when determining the equality_postfix.

Here is the corrected code snippet for the _parse_filter method:

def _parse_filter(self, filter: MetadataFilter) -> dict:
    """Parse a single MetadataFilter to equivalent OpenSearch expression.

    As Opensearch does not differentiate between scalar/array keyword fields, IN and ANY are equivalent.
    """
    key = f"metadata.{filter.key}"
    op = filter.operator
    value = filter.value  # Ensure the value is correctly referenced

    equality_postfix = ".keyword" if self._is_text_field(value=value) else ""
    if op == FilterOperator.EQ:
        return {"term": {f"{key}{equality_postfix}": value}}
    elif op in [
        FilterOperator.GT,
        FilterOperator.GTE,
        FilterOperator.LT,
        FilterOperator.LTE,
    ]:
        return {"range": {key: {filter.operator.name.lower(): value}}}
    elif op == FilterOperator.NE:
        return {
            "bool": {
                "must_not": {"term": {f"{key}{equality_postfix}": value}}
            }
        }
    elif op in [FilterOperator.IN, FilterOperator.ANY]:
        return {"terms": {key: value}}
    elif op == FilterOperator.NIN:
        return {"bool": {"must_not": {"terms": {key: value}}}}
    elif op == FilterOperator.ALL:
        return {
            "terms_set": {
                key: {
                    "terms": value,
                    "minimum_should_match_script": {"source": "params.num_terms"},
                }
            }
        }
    elif op == FilterOperator.TEXT_MATCH:
        return {"match": {key: {"query": value, "fuzziness": "AUTO"}}}
    elif op == FilterOperator.CONTAINS:
        return {"wildcard": {key: f"*{value}*"}}
    else:
        raise ValueError(f"Unsupported filter operator: {filter.operator}")

Additionally, ensure that your test setup and execution environment are correctly configured. For example, the test_opensearch_client.py file from the LlamaIndex repository provides a comprehensive setup for testing OpenSearch integration, including starting OpenSearch locally, defining sample vectors, and running various tests. This setup can help you verify that your _parse_filter method works correctly in different scenarios [1][2].

To continue talking to Dosu, mention @dosu.

GeorgeDittmar commented 4 weeks ago

@dosu I am already doing that and still am getting compile error from aws opensesarch serverless. We have come full circle. Can someone else chime in possibly? I have also rebuilt the index using llamaindex defaults and no luck. there was an update 2 weeks ago to try to handle filters better but I am curious how it was tested becasue it really does seem like the query being generated is wrong somehow. if I remove the function called in score_script and just use knn score it returns and doesnt throw an error so maybe there is some issue with the painless script? I've tried indexes using faiss and nmslib.

dosubot[bot] commented 4 weeks ago

This is a tough nut to crack, could really use your help on this, @logan-markewich.

To continue talking to Dosu, mention @dosu.

GeorgeDittmar commented 4 weeks ago

to add more context I am also trying the query generated by llamaindex code directly against opensearch-py standalone and getting these same errors

GeorgeDittmar commented 3 weeks ago

Ok figured out its due to opensearch serverless not supporting that function type. I will put in a PR to better support Opensearch Serverless since the default query thats generated seems to assume full query language support and uses l2squared.