run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.5k stars 5.01k forks source link

[Question]: list index out of range when asking a question with query_engine.query #12017

Closed MRX2005nzr closed 2 months ago

MRX2005nzr commented 6 months ago

Question Validation

Question

I have build a KnowledgeGraph RAG with my own data and now I want to ask some questions by this RAG. But something went wrong when I run my code. And the bug disappears when I reduce the length of my question. I don't what's happened. Is my question too long?

System: MacOS 13.6.1 Editor:jupyter notebook requirement I used : llama_index== 0.10.13

the whole code is shown below:

from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

# For Azure OpenAI
api_key = "###"
azure_endpoint = "###"
api_version = "2023-05-15"

llm = AzureOpenAI(
    model="gpt-35-turbo",
    deployment_name="Test-trans-01",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

# You need to deploy your own embedding model as well as your own chat completion model
embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="embedding-for-memory",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

import os
os.environ["NEBULA_USER"] = "root"
os.environ["GRAPHD_HOST"] = "127.0.0.1"
os.environ["NEBULA_PASSWORD"] = "nebula" 
os.environ["NEBULA_ADDRESS"] = "127.0.0.1:9669" 

%reload_ext ngql
connection_string = f"--address {os.environ['GRAPHD_HOST']} --port 9669 --user root --password {os.environ['NEBULA_PASSWORD']}"
%ngql {connection_string}
# %ngql CREATE SPACE IF NOT EXISTS llamaindex(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);
%ngql CREATE SPACE IF NOT EXISTS data_test_space(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);

%ngql SHOW SPACES;

%%ngql
#USE llamaindex;
USE data_test_space;
CREATE TAG IF NOT EXISTS entity(name string);
CREATE EDGE IF NOT EXISTS relationship(relationship string);

%ngql CREATE TAG INDEX IF NOT EXISTS entity_index ON entity(name(256));

#space_name = "llamaindex"
space_name = "data_test_space"
edge_types, rel_prop_names = ["relationship"], [
    "relationship"
]  # default, could be omit if create from an empty kg
tags = ["entity"]  # default, could be omit if create from an empty kg

from llama_index.core import StorageContext
from llama_index.graph_stores.nebula import NebulaGraphStore

graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

from llama_index.core import download_loader
from llama_index.core import SimpleDirectoryReader
# from llama_index.legacy.readers.file.base import SimpleDirectoryReader
from llama_index.readers.wikipedia import WikipediaReader

loader = SimpleDirectoryReader(
    #"/Users/nzr/Desktop/Python/stip/GraphRAG/data_test.txt"
    input_files=["/Users/nzr/Desktop/Coding合集/Python/stip/RAG所用数据/nzr_data.txt"]
)
documents = loader.load_data()

from llama_index.core import KnowledgeGraphIndex

kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

%ngql MATCH ()-[e]->() RETURN e LIMIT 100
%ng_draw

from IPython.display import Markdown
query_engine = kg_index.as_query_engine()
response = query_engine.query("现在用户输入了一个记录着车流信息的xml文件,请写出MySumo进行车流建模分析的工作流程")
display(Markdown(f"<b>{response}</b>"))

And the whole Traceback is shown below:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[13], line 3
      1 from IPython.display import Markdown
      2 query_engine = kg_index.as_query_engine()
----> 3 response = query_engine.query("现在用户输入了一个记录着车流信息的xml文件,请写出MySumo进行车流建模分析的工作流程")
      4 display(Markdown(f"<b>{response}</b>"))

File /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.10/site-packages/llama_index/core/base/base_query_engine.py:40, in BaseQueryEngine.query(self, str_or_query_bundle)
     38 if isinstance(str_or_query_bundle, str):
     39     str_or_query_bundle = QueryBundle(str_or_query_bundle)
---> 40 return self._query(str_or_query_bundle)

File /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.10/site-packages/llama_index/core/query_engine/retriever_query_engine.py:186, in RetrieverQueryEngine._query(self, query_bundle)
    182 """Answer a query."""
    183 with self.callback_manager.event(
    184     CBEventType.QUERY, payload={EventPayload.QUERY_STR: query_bundle.query_str}
    185 ) as query_event:
--> 186     nodes = self.retrieve(query_bundle)
    187     response = self._response_synthesizer.synthesize(
    188         query=query_bundle,
    189         nodes=nodes,
    190     )
    192     query_event.on_end(payload={EventPayload.RESPONSE: response})

File /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.10/site-packages/llama_index/core/query_engine/retriever_query_engine.py:142, in RetrieverQueryEngine.retrieve(self, query_bundle)
    141 def retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
--> 142     nodes = self._retriever.retrieve(query_bundle)
    143     return self._apply_node_postprocessors(nodes, query_bundle=query_bundle)

File /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.10/site-packages/llama_index/core/base/base_retriever.py:229, in BaseRetriever.retrieve(self, str_or_query_bundle)
    224 with self.callback_manager.as_trace("query"):
    225     with self.callback_manager.event(
    226         CBEventType.RETRIEVE,
    227         payload={EventPayload.QUERY_STR: query_bundle.query_str},
    228     ) as retrieve_event:
--> 229         nodes = self._retrieve(query_bundle)
    230         nodes = self._handle_recursive_retrieval(query_bundle, nodes)
    231         retrieve_event.on_end(
    232             payload={EventPayload.NODES: nodes},
    233         )

File /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.10/site-packages/llama_index/core/indices/knowledge_graph/retrievers.py:285, in KGTableRetriever._retrieve(self, query_bundle)
    282 # When include_text = True just get the actual content of all the nodes
    283 # (Nodes with actual keyword match, Nodes which are found from the depth search and Nodes founnd from top_k similarity)
    284 if self._include_text:
--> 285     keywords = self._extract_rel_text_keywords(
    286         rel_texts
    287     )  # rel_texts will have all the Triplets retrieved with respect to the Query
    288     nested_node_ids = [
    289         self._index_struct.search_node_by_keyword(keyword)
    290         for keyword in keywords
    291     ]
    292     node_ids = [_id for ids in nested_node_ids for _id in ids]

File /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.10/site-packages/llama_index/core/indices/knowledge_graph/retrievers.py:177, in KGTableRetriever._extract_rel_text_keywords(self, rel_texts)
    175     keywords.append(keyword.strip("(\"'"))
    176 # Return the Object as well
--> 177 keyword = rel_text.split(",")[2]
    178 if keyword:
    179     keywords.append(keyword.strip(" ()\"'"))

IndexError: list index out of range
dosubot[bot] commented 6 months ago

Hey @MRX2005nzr, good to see you back here! I hope all is well on your end.

Based on the traceback you provided, the error seems to be occurring in the _extract_rel_text_keywords method of the KGTableRetriever class. This method is trying to split the rel_text string into three parts, but it seems that in some cases, the rel_text string does not have enough commas to be split into three parts, hence the "list index out of range" error.

Here is the problematic code:

# File /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.10/site-packages/llama_index/core/indices/knowledge_graph/retrievers.py:177
keyword = rel_text.split(",")[2]

This line is trying to access the third element (index 2) of the list returned by rel_text.split(","). If rel_text does not contain at least two commas, this will result in a "list index out of range" error.

To fix this issue, you could modify your data to ensure that every rel_text string contains at least two commas. Alternatively, you could modify the _extract_rel_text_keywords method to handle rel_text strings that do not contain enough commas. Here is an example of how you could do this:

# File /opt/homebrew/Caskroom/miniforge/base/envs/pytorch_env/lib/python3.10/site-packages/llama_index/core/indices/knowledge_graph/retrievers.py:177
split_rel_text = rel_text.split(",")
if len(split_rel_text) > 2:
    keyword = split_rel_text[2]
else:
    keyword = None  # or some other appropriate value

This code will prevent the "list index out of range" error by checking the length of split_rel_text before trying to access its third element.

As for the maximum length of a question that can be processed by the query_engine.query method, I couldn't find any specific limit in the LlamaIndex codebase. However, it's possible that the length of the question could affect the performance of the query engine, especially if the question is very long. If you're experiencing issues with long questions, you might want to try breaking them up into smaller parts and processing each part separately.

I hope this helps! If you have any other questions, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
MRX2005nzr commented 6 months ago

@dosu-bot I have tried to solve this problem by adding extra ',', but it didn't work. On the contrary, the code can run successfully when I reduce the length of question, although only one ',' in the question sentence.

dosubot[bot] commented 6 months ago

Based on the error traceback you provided, it seems like the issue is occurring in the _extract_rel_text_keywords method of the KGTableRetriever class. The error is raised when the method tries to access the third element (index 2) of the list created by splitting rel_text on commas. If rel_text doesn't contain at least three comma-separated elements, this will result in an IndexError.

The rel_text is likely derived from the question you're passing to the query_engine.query method. The method _extract_rel_text_keywords is expecting a certain format of the input string, which seems to be a comma-separated list of elements. If the input string doesn't meet this expectation, it could lead to the error you're experiencing.

Here's a potential solution:

You could modify your question to ensure it contains at least three comma-separated elements. However, this might not be a practical solution if it doesn't align with the actual questions you need to ask in your application.

A more robust solution would be to modify the _extract_rel_text_keywords method to handle cases where rel_text doesn't contain at least three comma-separated elements. This could involve checking the length of the list after splitting and providing a default value for keyword if the list is too short. Here's an example of how you could implement this:

def _extract_rel_text_keywords(self, rel_texts):
    keywords = []
    for rel_text in rel_texts:
        split_text = rel_text.split(",")
        if len(split_text) > 2:
            keyword = split_text[2]
        else:
            keyword = None  # or some other default value
        keywords.append(keyword)
    return keywords

This code will prevent the IndexError by checking the length of split_text before trying to access its third element. If split_text doesn't have a third element, keyword is set to None (or some other default value of your choice).

Please note that this is a workaround to prevent the error from occurring. Depending on the specifics of your application and the nature of the questions you're asking, you might need to adjust this solution to better suit your needs.

As for the issue of the error not occurring when the question length is reduced, it's possible that the shorter question accidentally meets the expected format of the rel_text (i.e., it contains at least three comma-separated elements), thus avoiding the IndexError.

I hope this helps! If you have any further questions or if something is unclear, please don't hesitate to ask.

Relevant documentation:

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
logan-markewich commented 5 months ago

I'm not able to replicate this. I wonder if its an issue with how the triplets were added to the knowledge graph

tslmy commented 5 months ago

现在用户输入了一个记录着车流信息的xml文件,请写出MySumo进行车流建模分析的工作流程

I don't think that's how a query engine expects queries to be. What you wrote here looks to me more like a request for prose rather than a topic that you want to get data about.

I would probably write something like these:

Traffic from Beijing to Shanghai

货车 going passing through 267国道 between 12/31/2024 and 1/2/2025

(Try both Chinese and English. The underlying system might not be good at Chinese. You can never be sure.)