run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.64k stars 5.24k forks source link

[Bug]: #11942

Closed codyseally closed 4 months ago

codyseally commented 8 months ago

Bug Description

When trying to ingest pdf documents and build o knowledge graph using NebulaGraph as my graph store, I get an error related to a limit number for scan vertices.

documents = SimpleDirectoryReader("./data").load_data()

kg_index = KnowledgeGraphIndex.from_documents(documents=documents,
                                            storage_context=storage_context,
                                            max_triplets_per_chunk=2,
                                            space_name=space_name,
                                            edge_types=edge_types,
                                            llm=llm,
                                            embed_model=embed_model,
                                            rel_prop_names=rel_prop_names,
                                            tags=tags,
                                            include_embeddings=True,
                                            show_progress=True)

kg_index.get_networkx_graph()

This returns :

Query failed. Query: WITH map{`true`: '-[', `false`: '<-['} AS arrow_l,     map{`true`: ']->', `false`: ']-'} AS arrow_r,     map{`relationship`: "relationship"} AS edge_type_map MATCH p=(start)-[e:`relationship`*..1]-()   WHERE id(start) IN $subjs WITH start, id(start) AS vid, nodes(p) AS nodes, e AS rels,  length(p) AS rel_count, arrow_l, arrow_r, edge_type_map WITH   REDUCE(s = vid + '{', key IN [key_ in ['', 'name']     WHERE properties(start)[key_] IS NOT NULL]  | s + key + ': ' +       COALESCE(TOSTRING(properties(start)[key]), 'null') + ', ')      + '}'    AS subj,  [item in [i IN RANGE(0, rel_count - 1)|[nodes[i], nodes[i + 1],      rels[i], typeid(rels[i]) > 0, type(rels[i]) ]] | [    arrow_l[tostring(item[3])] +      item[4] + ':' +      REDUCE(s = '{', key IN SPLIT(edge_type_map[item[4]], ',') |         s + key + ': ' + COALESCE(TOSTRING(properties(item[2])[key]),        'null') + ', ') + '}'      +    arrow_r[tostring(item[3])],    REDUCE(s = id(item[1]) + '{', key IN [key_ in         ['', 'name'] WHERE properties(item[1])[key_]         IS NOT NULL]  | s + key + ': ' +         COALESCE(TOSTRING(properties(item[1])[key]), 'null') + ', ')        + '}'    ]  ] AS rels WITH   REPLACE(subj, ', }', '}') AS subj,  REDUCE(acc = collect(NULL), l in rels | acc + l) AS flattened_rels RETURN   subj,  REPLACE(REDUCE(acc = subj, l in flattened_rels | acc + ' ' + l),     ', }', '}')     AS flattened_rels  LIMIT 100, Param: {'subjs': Value(
    lVal=NList(
         values=[Value(
             sVal='Education'),
         Value(
             sVal='Teaching'),
         Value(
             sVal='Computers'),
         Value(
             sVal='Way information is gathered'),
         Value(
             sVal='Schools'),
         Value(
             sVal='Clear plan'),
         Value(
             sVal='School'),
         Value(
             sVal='Develop it plan'),
         Value(
             sVal='It'),
         Value(
             sVal='To be used'),
         Value(
             sVal='Necessary for'),
         Value(
...
         Value(
             sVal='Social media'),
         Value(
             sVal='Millennials and gen z')]))}Error message: Scan vertices or edges need to specify a limit number, or limit number can not push down.

The input is just a PDF document with a 3 page sample essay. If I try to perform a query using

hybrid_query_engine = kg_index.as_query_engine(include_text=True,
                                                llm=llm,
                                                response_mode="tree_summarize",
                                                embedding_mode="hybrid",
                                                similarity_top_k=3,
                                                explore_global_knowledge=True)

query_text = "What is education?"
response = hybrid_query_engine.query(query_text)
display(Markdown(f"<b>{response}</b>"))

I get the same error as above, followed by Query failed.

Version

llama-index==0.10.15

Steps to Reproduce

Try to run the provided code.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 8 months ago

Hey @codyseally, great to see you back! Hope you're doing well. 🚀

The error message "Scan vertices or edges need to specify a limit number, or limit number can not push down" you're encountering when trying to ingest PDF documents and build a knowledge graph using NebulaGraph with LlamaIndex version 0.10.15 is not directly related to the provided Python code from the llama-index-core/llama_index/core/indices/knowledge_graph/base.py file. This error message is more likely related to the underlying graph database operations, specifically to NebulaGraph, rather than the LlamaIndex library itself.

NebulaGraph requires specifying a limit when scanning vertices or edges to prevent overwhelming the system with too much data at once. This is a common requirement for graph databases to ensure performance and resource utilization are kept within reasonable bounds. The error suggests that a query operation is being performed without specifying this limit, or the system is unable to apply the limit as expected.

However, the provided Python code does not include direct interactions with NebulaGraph's API or query language (nGQL). The code is primarily focused on the construction and manipulation of a knowledge graph index within the LlamaIndex framework, including the extraction of triplets from text and the management of these triplets within an in-memory graph store (SimpleGraphStore).

To resolve the error, you would need to review the parts of your system that interact directly with NebulaGraph. Specifically, look for any queries that scan vertices or edges and ensure they include a LIMIT clause with a reasonable number. If you're using a library or framework that abstracts away the direct nGQL queries, check its documentation or source code to understand how it expects limits to be specified and ensure your usage complies with these expectations.

In summary, the solution involves reviewing and adjusting the NebulaGraph query operations to include limits as required, which is not directly addressed by the provided Python code snippet.

Regarding the changes made to the LlamaIndex codebase between version 0.10.15 and the latest version 0.10.19, I'm unable to provide specific details as I don't have access to the repository's change logs or version history. It's recommended to check the repository's release notes or commit history for this information.

As for the role of NebulaGraph in the LlamaIndex framework, it serves as the backbone for a versatile query engine pack, enabling various modes of data retrieval and interaction with knowledge graphs and vector indexes. The NebulaGraphQueryEnginePack class, as defined in the provided context, leverages NebulaGraph to support multiple query engine types, each catering to different retrieval needs. You can find more details about these modes in the NebulaGraphQueryEnginePack class documentation.

I hope this information helps. If you have further questions or need more clarification, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.