Closed Blandrust closed 2 months ago
Hey there, @Blandrust! I'm here to help you with any bugs, questions, or contributions. Let's tackle this issue together while we wait for a human maintainer.
Yes, there is a way to modify how LlamaIndex constructs the filter query for OpenSearch. The _parse_filter
and _parse_filters_recursively
methods in the OpensearchVectorClient
class are responsible for converting MetadataFilters
into OpenSearch queries. You can modify these methods to change how the filter queries are constructed.
Here is the relevant code snippet from the OpensearchVectorClient
class:
def _parse_filter(self, filter: MetadataFilter) -> dict:
"""Parse a single MetadataFilter to equivalent OpenSearch expression.
As Opensearch does not differentiate between scalar/array keyword fields, IN and ANY are equivalent.
"""
key = f"metadata.{filter.key}"
op = filter.operator
equality_postfix = ".keyword" if self._is_text_field(value=filter.value) else ""
if op == FilterOperator.EQ:
return {"term": {f"{key}{equality_postfix}": filter.value}}
elif op in [
FilterOperator.GT,
FilterOperator.GTE,
FilterOperator.LT,
FilterOperator.LTE,
]:
return {"range": {key: {filter.operator.name.lower(): filter.value}}}
elif op == FilterOperator.NE:
return {
"bool": {
"must_not": {"term": {f"{key}{equality_postfix}": filter.value}}
}
}
elif op in [FilterOperator.IN, FilterOperator.ANY]:
return {"terms": {key: filter.value}}
elif op == FilterOperator.NIN:
return {"bool": {"must_not": {"terms": {key: filter.value}}}}
elif op == FilterOperator.ALL:
return {
"terms_set": {
key: {
"terms": filter.value,
"minimum_should_match_script": {"source": "params.num_terms"},
}
}
}
elif op == FilterOperator.TEXT_MATCH:
return {"match": {key: {"query": filter.value, "fuzziness": "AUTO"}}}
elif op == FilterOperator.CONTAINS:
return {"wildcard": {key: f"*{filter.value}*"}}
else:
raise ValueError(f"Unsupported filter operator: {filter.operator}")
def _parse_filters_recursively(self, filters: MetadataFilters) -> dict:
"""Parse (possibly nested) MetadataFilters to equivalent OpenSearch expression."""
condition_map = {FilterCondition.AND: "must", FilterCondition.OR: "should"}
bool_clause = condition_map[filters.condition]
bool_query: dict[str, dict[str, list[dict]]] = {"bool": {bool_clause: []}}
for filter in filters.filters:
if isinstance(filter, MetadataFilters):
bool_query["bool"][bool_clause].append(self._parse_filters_recursively(filter))
else:
bool_query["bool"][bool_clause].append(self._parse_filter(filter))
return bool_query
You can customize these methods to fit your specific requirements for constructing filter queries in OpenSearch.
Additionally, the LlamaIndex documentation provides examples of how to use the ElasticsearchStore
class and apply metadata filters. For instance, you can use the ExactMatchFilter
and MetadataFilters
classes to filter documents based on metadata:
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
metadata_store = ElasticsearchStore(
es_url="http://localhost:9200", # for Elastic Cloud authentication see above
index_name="movies_metadata",
)
storage_context = StorageContext.from_defaults(vector_store=metadata_store)
index = VectorStoreIndex(movies, storage_context=storage_context)
# Metadata filter
filters = MetadataFilters(
filters=[ExactMatchFilter(key="theme", value="Mafia")]
)
retriever = index.as_retriever(filters=filters)
results = retriever.retrieve("What is inception about?")
print_results(results)
For more advanced custom filters, you can override the query by passing a custom query function:
def custom_query(query, query_str):
print("custom query", query)
return query
query_engine = index.as_query_engine(
vector_store_kwargs={
"es_filter": [{"match": {"title": "matrix"}}],
"custom_query": custom_query,
}
)
query_engine.query("what is this movie about?")
This allows you to leverage the full range of filtering capabilities provided by Elasticsearch.
Moreover, the test file test_opensearch_client.py
in the LlamaIndex repository provides practical examples of how to use these filters in various scenarios, including nested filters and different filter operators. This can serve as a useful reference for understanding and customizing the filter query construction process [1][2][3].
To continue talking to Dosu, mention @dosu.
@Blandrust There was a bug in how llama_index handled metadata filtering with opensearch. I believe it has been fixed last week. So your problem should be solved if you simply upgrade to the latest llama_index version. If you don't want to upgrade right now, you will need to update the _parse_filters method of the OpensearchVectorClient class where the problem is. The filter should look like:
'filter': [{'term': {'metadata.category': 'y'}}]
It needs the term
key around the metadata with metadata.
added as a prefix to the filter key.
from llama_index.vector_stores.opensearch import OpensearchVectorStore, OpensearchVectorClient
# Creating a new _parse_filters method to fix the default methods
def new_parse_filters(self, filters: Optional[MetadataFilters]) -> Any:
pre_filter = []
# print(filters)
if filters is not None:
for f in filters.legacy_filters():
if isinstance(f.value, list):
pre_filter.append({"terms": {f.key: f.value}})
elif isinstance(f.value, str):
pre_filter.append({"term": {f.key: f.value}})
else:
pre_filter.append({f.key: json.loads(str(f.value))})
return pre_filter
OpensearchVectorClient._parse_filters = new_parse_filters
Adding this will update the _parse_filters method and format the filters properly for now. Remember, before the recent fix, in MetadataFilter, the key had to be metadata.
followed by the name. So metadata.category
in your case. Though you don't have to do this if you just upgrade the llama_index version.
note: the answer given by dosu bot is based on this latest version where the metadata filtering has been fixed and might not be relevant to your question about a previous version. Upgrading would be the best way to solve your problem.
This is solved if you update: pip install -U llama-index-vector-stores-opensearch
Question Validation
Question
I'm using LlamaIndex as a knowledge base with OpenSearch as the vector store. Everything works fine until I try to add MetadataFilters to my query. Here's my current query function:
Without filters, the function retrieves documents as expected. However, when I add the filters, I get this error:
RequestError: RequestError(400, 'x_content_parse_exception', '[category] query malformed, no start_object after query name')
I logged the query sent to OpenSearch, and it looks something like this:
I believe the filter should look more like this:
Is there a way to modify how LlamaIndex constructs the filter query for OpenSearch? Or am I misunderstanding how to use MetadataFilters with OpenSearch? Any help or guidance would be greatly appreciated. I'm not sure if this is a LlamaIndex issue or if I'm missing something in my OpenSearch configuration. Thank you in advance for your assistance!