Closed mw19930312 closed 3 months ago
Hey @mw19930312, great to see you back! Hope you're doing well. 😊
Thank you for providing the detailed bug report. It seems like the issue is related to the interaction between the RetrieverQueryEngine
and the Opensearch
component of the LlamaIndex
framework. The error message RequestError(400, 'x_content_parse_exception', '[level] query malformed, no start_object after query name')
suggests that the query being sent to Opensearch is not correctly formatted.
The error is being raised from the opensearchpy
library, which is used by the LlamaIndex
framework to interact with the Opensearch server. Specifically, the error occurs when the search
method of the opensearchpy
client is called. This method is used to execute a search query against the Opensearch index.
The search
method expects a properly formatted query object, and it seems like the query object being passed is not correctly structured, hence the x_content_parse_exception
error. This could be due to a variety of reasons, such as incorrect query syntax, missing required fields, or incompatible data types.
To resolve this issue, we need to identify the exact cause of the malformed query. Here are a few steps you can take:
Check the Query Syntax: Ensure that the query syntax is correct and compatible with the Opensearch query DSL. You can refer to the Opensearch Query DSL documentation for more details.
Inspect the Query Object: You can add some debug logs in the query
method of the Opensearch
component in the LlamaIndex
framework to print out the query object before it's sent to the search
method of the opensearchpy
client. This can help you identify any issues with the query structure.
Update the opensearchpy
Library: If you're using an older version of the opensearchpy
library, try updating it to the latest version. There might be compatibility issues between the version of Opensearch you're using and the opensearchpy
library.
Check the Opensearch Version: Ensure that the version of Opensearch you're using is compatible with the LlamaIndex
framework. If you're using a newer version of Opensearch, there might be breaking changes in the query DSL that are causing the issue.
If you're still facing the issue after trying these steps, please provide more details about the query you're trying to execute and the version of Opensearch you're using. This will help in identifying the exact cause of the issue and finding a suitable solution.
@mw19930312 This seems like an issue with how opensearch is formatting metadata filters? I've never used open search in my life lol So if you had time to make a PR that would be wonderful :)
@logan-markewich Thanks for the reply! I took a careful look at the opensearch nodes from my index. It seems that the leaf node provided to raptor does not have a metadata field named "level". Only the summarization nodes have "level" as a metatdata field. Is this an expected behavior? If not, how should I revise it?
Here are two example nodes in my index.
Node 1 (leaf node):
{
"content": "We propose to develop a new retrieval process that retrieves nodes hierarchically. This will be an initial step of implementing RAPTOR or other types of graph retrieval engine (e.g., GraphRAG). With a hierarchical retrieval system, we will be able to leverage additional information of a document (e.g., summary, structure, metadata) besides text chunks so that the performance of our agent is improved.",
"metadata": {
"header": "TL;DR",
"parent_id": "feaa2d47-e32f-486a-a38f-76c8ea1a0692",
"_node_content": "{\"id_\": \"a70ca67c-f96f-4ec8-9357-021975f7cc5f\", \"embedding\": null, \"metadata\": {\"header\": \"TL;DR\", \"parent_id\": \"feaa2d47-e32f-486a-a38f-76c8ea1a0692\"}, \"excluded_embed_metadata_keys\": [\"parent_id\"], \"excluded_llm_metadata_keys\": [\"parent_id\"], \"relationships\": {\"1\": {\"node_id\": \"3e09fea1-9552-43b5-9d41-e08754a60e3e\", \"node_type\": \"4\", \"metadata\": {\"header\": \"TL;DR\"}, \"hash\": \"db476100ee85889d9112e6c2c05ac4e8ccef13476f32a40ee2a98111d4e1ad6e\", \"class_name\": \"RelatedNodeInfo\"}, \"3\": {\"node_id\": \"57e04790-47e0-426a-ab32-2863a43d1468\", \"node_type\": \"1\", \"metadata\": {}, \"hash\": \"164443d022a50247644758a22c4a762c27d154617362c3994a293c4257effb17\", \"class_name\": \"RelatedNodeInfo\"}}, \"text\": \"\", \"start_char_idx\": 0, \"end_char_idx\": 403, \"text_template\": \"{metadata_str}\\n\\n{content}\", \"metadata_template\": \"{key}: {value}\", \"metadata_seperator\": \"\\n\", \"class_name\": \"TextNode\"}",
"_node_type": "TextNode",
"document_id": "3e09fea1-9552-43b5-9d41-e08754a60e3e",
"doc_id": "3e09fea1-9552-43b5-9d41-e08754a60e3e",
"ref_doc_id": "3e09fea1-9552-43b5-9d41-e08754a60e3e"
}
}
Node 2 (summary node):
{
"content": "A new hierarchical retrieval process is proposed to enhance the performance of a security agent by leveraging additional information like summaries, structures, and metadata during document retrieval. The urgency of implementing this before releasing the agent to customers is emphasized to improve user experience, particularly in support and analytics scenarios. The decision may increase precision but could also raise latency and incur a cost of approximately 3-4 person-weeks for potential infrastructure changes. The current issue with the agent failing to answer questions correctly due to retrieval limitations is highlighted, prompting the need for a more sophisticated retrieval system. The proposed solution involves creating hierarchical information during ingestion and retrieving related nodes using breadth-first-search to maintain structural integrity. The benefits include improved agent performance, reusability for building more complex retrieval systems, and addressing existing pain points in support/analytics use cases. However, risks such as increased latency and potential compatibility issues with Opensearch are acknowledged, with mitigations in place to optimize algorithms and develop custom libraries if needed.",
"metadata": {
"level": 0,
"parent_id": "e198a77b-b77c-4cd9-8977-e134ab0df065",
"_node_content": "{\"id_\": \"feaa2d47-e32f-486a-a38f-76c8ea1a0692\", \"embedding\": null, \"metadata\": {\"level\": 0, \"parent_id\": \"e198a77b-b77c-4cd9-8977-e134ab0df065\"}, \"excluded_embed_metadata_keys\": [\"level\", \"parent_id\"], \"excluded_llm_metadata_keys\": [\"level\", \"parent_id\"], \"relationships\": {}, \"text\": \"\", \"start_char_idx\": null, \"end_char_idx\": null, \"text_template\": \"{metadata_str}\\n\\n{content}\", \"metadata_template\": \"{key}: {value}\", \"metadata_seperator\": \"\\n\", \"class_name\": \"TextNode\"}",
"_node_type": "TextNode",
"document_id": "None",
"doc_id": "None",
"ref_doc_id": "None"
}
}
I took a further look at the code. It seems that I need to revise the query to be the following with level
changed to be metadata.level
and an additional layer of term
. What would be the best approach moving forward? Should I submit a PR to llama_index for review?
temp_query = {
"size": 1,
"query": {
"script_score": {
"query": {"bool": {"filter": [{"term": {"metadata.level": 2}}]}},
"script": {
"source": "1/(1.0 + l2Squared(params.query_value, doc[\'embedding\']))",
"params": {
"field": "embedding",
"query_value": [blablabla]
}
}
}
}
}
So, leaf nodes shouldn't have a level, since they are children to a parent summary
Summaries represent the entire cluster at a particular level, and then we can retrieve all nodes from that particular level using the ID of the parent summary
@mw19930312 if that is a fix for opensearch metadata filters in general, happy to have a pr to fix that. Seems like maybe the metadata filter construction for open search is incorrect
Bug Description
I'm following the RAPTOR notebook to conduct a few experiments using Opensearch. However, the RetrieverQueryEngine fails to generate a correct answer due to failure in opensearchpy.
Version
0.10.16
Steps to Reproduce
llama_index_documents = convert_to_llama_index_document(parsed_google_docs) from llama_index.packs.raptor import RaptorPack
raptor_pack_google_doc = RaptorPack( llama_index_documents, embed_model=OpenAIEmbedding( model="text-embedding-ada-002" ), # used for embedding clusters llm=OpenAI(model="gpt-3.5-turbo", temperature=0), # used for generating summaries vector_store=vector_store, # used for storage similarity_top_k=5, # top k for each layer, or overall top-k for collapsed mode="tree_traversal", # sets default mode transformations=[ SentenceSplitter(chunk_size=400, chunk_overlap=50) ], # transformations applied for ingestion )
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args( raptor_pack_google_doc.retriever, llm=OpenAI(model="gpt-4-1106-preview", temperature=0) ) response = query_engine.query("Hello") print(str(response))
Relevant Logs/Tracbacks