[Bug]: RAPTOR query engine not working with Opensearch

mw19930312 commented 6 months ago

Bug Description

I'm following the RAPTOR notebook to conduct a few experiments using Opensearch. However, the RetrieverQueryEngine fails to generate a correct answer due to failure in opensearchpy.

Version

0.10.16

Steps to Reproduce

llama_index_documents = convert_to_llama_index_document(parsed_google_docs) from llama_index.packs.raptor import RaptorPack

raptor_pack_google_doc = RaptorPack( llama_index_documents, embed_model=OpenAIEmbedding( model="text-embedding-ada-002" ), # used for embedding clusters llm=OpenAI(model="gpt-3.5-turbo", temperature=0), # used for generating summaries vector_store=vector_store, # used for storage similarity_top_k=5, # top k for each layer, or overall top-k for collapsed mode="tree_traversal", # sets default mode transformations=[ SentenceSplitter(chunk_size=400, chunk_overlap=50) ], # transformations applied for ingestion )

from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args( raptor_pack_google_doc.retriever, llm=OpenAI(model="gpt-4-1106-preview", temperature=0) ) response = query_engine.query("Hello") print(str(response))

Relevant Logs/Tracbacks

tests/performance_test/test_raptor.py:86: in <module>
    response = query_engine.query("Hello")
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/core/base/base_query_engine.py:40: in query
    return self._query(str_or_query_bundle)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/core/query_engine/retriever_query_engine.py:186: in _query
    nodes = self.retrieve(query_bundle)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/core/query_engine/retriever_query_engine.py:142: in retrieve
    nodes = self._retriever.retrieve(query_bundle)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/packs/raptor/base.py:289: in retrieve
    return asyncio.run(self.aretrieve(query_str, mode or self.mode))
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/nest_asyncio.py:30: in run
    return loop.run_until_complete(task)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/nest_asyncio.py:98: in run_until_complete
    return f.result()
/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/futures.py:203: in result
    raise self._exception.with_traceback(self._exception_tb)
/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py:277: in __step
    result = coro.send(None)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/packs/raptor/base.py:302: in aretrieve
    return await self.tree_traversal_retrieval(query_str)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/packs/raptor/base.py:242: in tree_traversal_retrieval
    nodes = await self.index.as_retriever(
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/core/base/base_retriever.py:249: in aretrieve
    nodes = await self._aretrieve(query_bundle)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/core/indices/vector_store/retrievers/retriever.py:105: in _aretrieve
    return await self._aget_nodes_with_embeddings(query_bundle)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/core/indices/vector_store/retrievers/retriever.py:177: in _aget_nodes_with_embeddings
    query_result = await self._vector_store.aquery(query, **self._kwargs)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/core/vector_stores/types.py:365: in aquery
    return self.query(query, **kwargs)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/vector_stores/opensearch/base.py:480: in query
    return self._client.query(
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/vector_stores/opensearch/base.py:383: in query
    res = self._os_client.search(
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/opensearchpy/client/utils.py:181: in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/opensearchpy/client/__init__.py:1742: in search
    return self.transport.perform_request(
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/opensearchpy/transport.py:448: in perform_request
    raise e
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/opensearchpy/transport.py:409: in perform_request
    status, headers_response, data = connection.perform_request(
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/opensearchpy/connection/http_requests.py:232: in perform_request
    self._raise_error(
../../../Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/opensearchpy/connection/base.py:316: in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
E   opensearchpy.exceptions.RequestError: RequestError(400, 'x_content_parse_exception', '[level] query malformed, no start_object after query name')

dosubot[bot] commented 6 months ago

Hey @mw19930312, great to see you back! Hope you're doing well. 😊

Thank you for providing the detailed bug report. It seems like the issue is related to the interaction between the RetrieverQueryEngine and the Opensearch component of the LlamaIndex framework. The error message RequestError(400, 'x_content_parse_exception', '[level] query malformed, no start_object after query name') suggests that the query being sent to Opensearch is not correctly formatted.

The error is being raised from the opensearchpy library, which is used by the LlamaIndex framework to interact with the Opensearch server. Specifically, the error occurs when the search method of the opensearchpy client is called. This method is used to execute a search query against the Opensearch index.

The search method expects a properly formatted query object, and it seems like the query object being passed is not correctly structured, hence the x_content_parse_exception error. This could be due to a variety of reasons, such as incorrect query syntax, missing required fields, or incompatible data types.

To resolve this issue, we need to identify the exact cause of the malformed query. Here are a few steps you can take:

Check the Query Syntax: Ensure that the query syntax is correct and compatible with the Opensearch query DSL. You can refer to the Opensearch Query DSL documentation for more details.
Inspect the Query Object: You can add some debug logs in the query method of the Opensearch component in the LlamaIndex framework to print out the query object before it's sent to the search method of the opensearchpy client. This can help you identify any issues with the query structure.
Update the opensearchpy Library: If you're using an older version of the opensearchpy library, try updating it to the latest version. There might be compatibility issues between the version of Opensearch you're using and the opensearchpy library.
Check the Opensearch Version: Ensure that the version of Opensearch you're using is compatible with the LlamaIndex framework. If you're using a newer version of Opensearch, there might be breaking changes in the query DSL that are causing the issue.

If you're still facing the issue after trying these steps, please provide more details about the query you're trying to execute and the version of Opensearch you're using. This will help in identifying the exact cause of the issue and finding a suitable solution.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 6 months ago

@mw19930312 This seems like an issue with how opensearch is formatting metadata filters? I've never used open search in my life lol So if you had time to make a PR that would be wonderful :)

logan-markewich commented 6 months ago

The filters get parsed here https://github.com/run-llama/llama_index/blob/3ec5f5d62803a710bcccf4d679cdfb6ec51ae78e/llama-index-integrations/vector_stores/llama-index-vector-stores-opensearch/llama_index/vector_stores/opensearch/base.py#L165

mw19930312 commented 6 months ago

@logan-markewich Thanks for the reply! I took a careful look at the opensearch nodes from my index. It seems that the leaf node provided to raptor does not have a metadata field named "level". Only the summarization nodes have "level" as a metatdata field. Is this an expected behavior? If not, how should I revise it?

Here are two example nodes in my index.

Node 1 (leaf node):

{
    "content": "We propose to develop a new retrieval process that retrieves nodes hierarchically. This will be an initial step of implementing RAPTOR or other types of graph retrieval engine (e.g., GraphRAG). With a hierarchical retrieval system, we will be able to leverage additional information of a document (e.g., summary, structure, metadata) besides text chunks so that the performance of our agent is improved.",
    "metadata": {
        "header": "TL;DR",
        "parent_id": "feaa2d47-e32f-486a-a38f-76c8ea1a0692",
        "_node_content": "{\"id_\": \"a70ca67c-f96f-4ec8-9357-021975f7cc5f\", \"embedding\": null, \"metadata\": {\"header\": \"TL;DR\", \"parent_id\": \"feaa2d47-e32f-486a-a38f-76c8ea1a0692\"}, \"excluded_embed_metadata_keys\": [\"parent_id\"], \"excluded_llm_metadata_keys\": [\"parent_id\"], \"relationships\": {\"1\": {\"node_id\": \"3e09fea1-9552-43b5-9d41-e08754a60e3e\", \"node_type\": \"4\", \"metadata\": {\"header\": \"TL;DR\"}, \"hash\": \"db476100ee85889d9112e6c2c05ac4e8ccef13476f32a40ee2a98111d4e1ad6e\", \"class_name\": \"RelatedNodeInfo\"}, \"3\": {\"node_id\": \"57e04790-47e0-426a-ab32-2863a43d1468\", \"node_type\": \"1\", \"metadata\": {}, \"hash\": \"164443d022a50247644758a22c4a762c27d154617362c3994a293c4257effb17\", \"class_name\": \"RelatedNodeInfo\"}}, \"text\": \"\", \"start_char_idx\": 0, \"end_char_idx\": 403, \"text_template\": \"{metadata_str}\\n\\n{content}\", \"metadata_template\": \"{key}: {value}\", \"metadata_seperator\": \"\\n\", \"class_name\": \"TextNode\"}",
        "_node_type": "TextNode",
        "document_id": "3e09fea1-9552-43b5-9d41-e08754a60e3e",
        "doc_id": "3e09fea1-9552-43b5-9d41-e08754a60e3e",
        "ref_doc_id": "3e09fea1-9552-43b5-9d41-e08754a60e3e"
    }
}

Node 2 (summary node):

{
    "content": "A new hierarchical retrieval process is proposed to enhance the performance of a security agent by leveraging additional information like summaries, structures, and metadata during document retrieval. The urgency of implementing this before releasing the agent to customers is emphasized to improve user experience, particularly in support and analytics scenarios. The decision may increase precision but could also raise latency and incur a cost of approximately 3-4 person-weeks for potential infrastructure changes. The current issue with the agent failing to answer questions correctly due to retrieval limitations is highlighted, prompting the need for a more sophisticated retrieval system. The proposed solution involves creating hierarchical information during ingestion and retrieving related nodes using breadth-first-search to maintain structural integrity. The benefits include improved agent performance, reusability for building more complex retrieval systems, and addressing existing pain points in support/analytics use cases. However, risks such as increased latency and potential compatibility issues with Opensearch are acknowledged, with mitigations in place to optimize algorithms and develop custom libraries if needed.",
    "metadata": {
        "level": 0,
        "parent_id": "e198a77b-b77c-4cd9-8977-e134ab0df065",
        "_node_content": "{\"id_\": \"feaa2d47-e32f-486a-a38f-76c8ea1a0692\", \"embedding\": null, \"metadata\": {\"level\": 0, \"parent_id\": \"e198a77b-b77c-4cd9-8977-e134ab0df065\"}, \"excluded_embed_metadata_keys\": [\"level\", \"parent_id\"], \"excluded_llm_metadata_keys\": [\"level\", \"parent_id\"], \"relationships\": {}, \"text\": \"\", \"start_char_idx\": null, \"end_char_idx\": null, \"text_template\": \"{metadata_str}\\n\\n{content}\", \"metadata_template\": \"{key}: {value}\", \"metadata_seperator\": \"\\n\", \"class_name\": \"TextNode\"}",
        "_node_type": "TextNode",
        "document_id": "None",
        "doc_id": "None",
        "ref_doc_id": "None"
    }
}

mw19930312 commented 6 months ago

I took a further look at the code. It seems that I need to revise the query to be the following with level changed to be metadata.level and an additional layer of term. What would be the best approach moving forward? Should I submit a PR to llama_index for review?

temp_query = {
    "size": 1, 
    "query": {
        "script_score": {
            "query": {"bool": {"filter": [{"term": {"metadata.level": 2}}]}}, 
            "script": {
                "source": "1/(1.0 + l2Squared(params.query_value, doc[\'embedding\']))", 
                "params": {
                    "field": "embedding", 
                    "query_value": [blablabla]
                    }
                }
        }
    }
}

logan-markewich commented 6 months ago

So, leaf nodes shouldn't have a level, since they are children to a parent summary

Summaries represent the entire cluster at a particular level, and then we can retrieve all nodes from that particular level using the ID of the parent summary

logan-markewich commented 6 months ago

@mw19930312 if that is a fix for opensearch metadata filters in general, happy to have a pr to fix that. Seems like maybe the metadata filter construction for open search is incorrect

run-llama / llama_index