run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.71k stars 5.05k forks source link

[Bug]: Extremely long time initializing Neo4jPropertyGraphStore for larger graphs #16204

Open hypernovas opened 2 days ago

hypernovas commented 2 days ago

Bug Description

It takes about 14 min to initiate the graph store with 3558 entities. I feel this is because refresh_schema() does not handle large graphs well. Maybe not using async?

I pasted the log I ingested into the lib code. This is taking way too long if for things beyond experiments. Wondering if you guys could help on this?

2024-09-24 17:22:40.552042 Step 1: Sanitize query output
2024-09-24 17:22:40.552075 Step 2: Enhanced schema
2024-09-24 17:22:40.552079 Step 3: Create driver
2024-09-24 17:22:40.552180 Step 4: Create async driver
2024-09-24 17:22:40.552232 Step 5: Set database
2024-09-24 17:22:40.552236 Step 6: Create structured schema
2024-09-24 17:22:40.552238 Step 7: Create supports vector index
2024-09-24 17:36:02.882747 Step 8: Create index
2024-09-24 17:36:02.888786 Step 9: Verify version

Version

llama-index-graph-stores-neo4j==0.3.2, llama-index==0.11.11

Steps to Reproduce

  1. Use local Neo4j desktop server
  2. Ingest more than 3000 entities with proper relationships (reasonable text description)
  3. Load a Neo4j graph_store
    graph = Neo4jPropertyGraphStore(
    username="neo4j", password="xxxxxx", url="bolt://localhost:7687", database="neo4j"
    )

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 2 days ago

The long initialization time for Neo4jPropertyGraphStore with larger graphs is likely due to the refresh_schema() method not handling large datasets efficiently and not using asynchronous operations. The refresh_schema() method performs several structured queries to gather schema information, which can be time-consuming for large datasets.

The Neo4jPropertyGraphStore class does have an asynchronous driver (_async_driver) and several asynchronous methods that could be leveraged to improve the performance of refresh_schema(). Implementing asynchronous queries could potentially reduce the initialization time by allowing multiple queries to be executed concurrently [1][2].

Additionally, the NebulaPropertyGraphStore class from the LlamaIndex framework provides an example of handling schema refresh and initialization more efficiently, which could be adapted for Neo4jPropertyGraphStore [3].

Consider modifying the refresh_schema() method to use asynchronous queries to improve performance.

To continue talking to Dosu, mention @dosu.