run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.29k stars 4.96k forks source link

[Bug]: Inefficient Neo4j Driver Initialization and Missing Connection Closure in Neo4jPropertyGraphStore #14907

Open tareksanger opened 1 month ago

tareksanger commented 1 month ago

Bug Description

The current implementation of the Neo4jPropertyGraphStore class in Llama Index initializes the Neo4j driver within the class constructor. According to the Neo4j documentation, driver objects are "immutable, thread-safe, and expensive to create. Therefore, applications should create only one instance of the driver and pass it around, rather than initializing it multiple times."

Furthermore, the class does not properly close the driver connections, which can lead to resource leaks and other issues over time.

Relevant Code:


class Neo4jPropertyGraphStore(PropertyGraphStore):

    ...

    def __init__(
        self,
        username: str,
        password: str,
        url: str,
        database: Optional[str] = "neo4j",
        refresh_schema: bool = True,
        sanitize_query_output: bool = True,
        enhanced_schema: bool = False,
        **neo4j_kwargs: Any,
    ) -> None:
        self.sanitize_query_output = sanitize_query_output
        self.enhanced_schema = enhanced_schema
        self._driver = neo4j.GraphDatabase.driver(
            url, auth=(username, password), **neo4j_kwargs
        )
        self._async_driver = neo4j.AsyncGraphDatabase.driver(
            url,
            auth=(username, password),
            **neo4j_kwargs,
        )
        self._database = database
        self.structured_schema = {}
        if refresh_schema:
            self.refresh_schema()

Problems:

Suggested Solution:

Proposed Code Changes:

  1. Modify the constructor to accept the driver as an argument:

def __init__(
    self,
    driver: neo4j.Driver,
    async_driver: neo4j.AsyncDriver,
    database: Optional[str] = "neo4j",
    refresh_schema: bool = True,
    sanitize_query_output: bool = True,
    enhanced_schema: bool = False,
) -> None:
    self.sanitize_query_output = sanitize_query_output
    self.enhanced_schema = enhanced_schema
    self._driver = driver
    self._async_driver = async_driver
    self._database = database
    self.structured_schema = {}
    if refresh_schema:
        self.refresh_schema()

Example Usage:

from neo4j import GraphDatabase
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

# Create a single driver instance
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "neo4j"))
async_driver = neo4j.AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "neo4j"))

# Pass the driver instance to the Neo4jPropertyGraphStore
graph_store = Neo4jPropertyGraphStore(
    driver=driver,
    async_driver=async_driver,
    database="neo4j"
)

# Ensure to close the driver when the application is shutting down
driver.close()
async_driver.close()

Conclusion:

By making these changes, the Neo4jPropertyGraphStore class will align with Neo4j’s best practices for driver usage, improving efficiency and resource management. The responsibility of managing the driver lifecycle, including its closure, should be left to the developer. I hope you consider these suggestions for future updates.

Version

0.10.57

Steps to Reproduce

  1. Instantiate the Neo4jPropertyGraphStore class multiple times in an application.
  2. Observe the multiple instances of the Neo4j driver being created.
  3. Note the absence of any method to close the driver connections properly.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 month ago

@tareksanger مرحبًا! I'm here to assist you with any issues you may have. I can help you solve bugs, answer questions, and even guide you on how to contribute. Feel free to ask anything while waiting for a human maintainer to address your concern.

To address the issues of inefficient Neo4j driver initialization and missing connection closure in the Neo4jPropertyGraphStore class, you can modify the class to accept an already initialized Neo4j driver as an argument. This approach aligns with Neo4j’s best practices and ensures proper resource management.

Here is the proposed modification to the Neo4jPropertyGraphStore class:

Modified Neo4jPropertyGraphStore Class

class Neo4jPropertyGraphStore(PropertyGraphStore):
    def __init__(
        self,
        driver: neo4j.Driver,
        async_driver: neo4j.AsyncDriver,
        database: Optional[str] = "neo4j",
        refresh_schema: bool = True,
        sanitize_query_output: bool = True,
        enhanced_schema: bool = False,
    ) -> None:
        self.sanitize_query_output = sanitize_query_output
        self.enhanced_schema = enhanced_schema
        self._driver = driver
        self._async_driver = async_driver
        self._database = database
        self.structured_schema = {}
        if refresh_schema:
            self.refresh_schema()

    @property
    def client(self):
        return self._driver

    def refresh_schema(self) -> None:
        """Refresh the schema."""
        node_properties = [el["output"] for el in self.query(node_properties_query)]
        rel_properties = [el["output"] for el in self.query(rel_properties_query)]
        relationships = [el["output"] for el in self.query(rel_query)]

        self.structured_schema = {
            "node_props": {el["labels"]: el["properties"] for el in node_properties},
            "rel_props": {el["type"]: el["properties"] for el in rel_properties},
            "relationships": relationships,
        }

        # Format node properties
        formatted_node_props = []
        for el in node_properties:
            props_str = ", ".join(
                [f"{prop['property']}: {prop['type']}" for prop in el["properties"]]
            )
            formatted_node_props.append(f"{el['labels']} {{{props_str}}}")

        # Format relationship properties
        formatted_rel_props = []
        for el in rel_properties:
            props_str = ", ".join(
                [f"{prop['property']}: {prop['type']}" for prop in el["properties"]]
            )
            formatted_rel_props.append(f"{el['type']} {{{props_str}}}")

        # Format relationships
        formatted_rels = [
            f"(:{el['start']})-[:{el['type']}]->(:{el['end']})" for el in relationships
        ]

        self.schema = "\n".join(
            [
                "Node properties are the following:",
                ",".join(formatted_node_props),
                "Relationship properties are the following:",
                ",".join(formatted_rel_props),
                "The relationships are the following:",
                ",".join(formatted_rels),
            ]
        )

    def query(self, query: str, param_map: Optional[Dict[str, Any]] = {}) -> Any:
        with self._driver.session(database=self._database) as session:
            result = session.run(query, param_map)
            return [d.data() for d in result]

Example Usage

from neo4j import GraphDatabase
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

# Create a single driver instance
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "neo4j"))
async_driver = neo4j.AsyncGraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "neo4j"))

# Pass the driver instance to the Neo4jPropertyGraphStore
graph_store = Neo4jPropertyGraphStore(
    driver=driver,
    async_driver=async_driver,
    database="neo4j"
)

# Ensure to close the driver when the application is shutting down
driver.close()
async_driver.close()

Conclusion

By making these changes, the Neo4jPropertyGraphStore class will align with Neo4j’s best practices for driver usage, improving efficiency and resource management. The responsibility of managing the driver lifecycle, including its closure, should be left to the developer [1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

Feel free to open a pr! 👍