neo4j / neo4j-python-driver

Neo4j Bolt driver for Python
https://neo4j.com/docs/api/python-driver/current/
Other
898 stars 186 forks source link

Result.fetch seems not to release memory #901

Closed aleksanderlech closed 1 year ago

aleksanderlech commented 1 year ago

Bug Report

Hello I had an usecase where I have to traverse whole database and do some action. For that I open a long running transaction that would query all of the results and return a cursor that would give me access to fetch all of the results. In the example below I am using fetch() to get the next 100 records from the query. This runs fine but the memory is growing into gigabytes and finally crashes the script. Is there something I am missing here or the fetch() does not really release some buffers while fetching the next part?

from neo4j import GraphDatabase

neo4j_host = "localhost"
neo4j_port = "7687"
neo4j_user = "neo4j"
neo4j_password = "xxxxxxx"
batch_size = 100

print(f"Connecting to {neo4j_host}")

driver = GraphDatabase.driver(f"bolt://{neo4j_host}:{neo4j_port}", auth=(neo4j_user, neo4j_password), encrypted=False)

with driver.session() as session:
    with session.begin_transaction() as transaction:
        total = transaction.run("MATCH (a) RETURN COUNT(a) AS count").single().data().get('count')
        result = transaction.run("MATCH (a) RETURN a")

        print(f"Total records to process is {total}")

        nodes = result.fetch(batch_size)

        while len(nodes) > 0:
            nodes = result.fetch(batch_size)

driver.close()
print("Done")

My Environment

Python Version: 3.11.2 Driver Version: 5.6.0 Server Version and Edition: 5.3.0 community Operating System: macOs 13.2.1 (22D68)

robsdedude commented 1 year ago

Hi and thanks for reaching out. Admittedly, I was unsure what's going on here myself at first. I then loaded a db with dummy data and tried your code. I was able to reproduce it and quickly realized the culprit: you are querying nodes. The driver will in the background track all graph types (nodes and relationships) and build an in-memory graph. This enables very convenient things like getting some_ralationship.start_node if that start node was anywhere in the result set. But sure enough it comes at the price of higher memory requirements.

What you can do to alleviate this: don't return the full node. You could, for example change

MATCH (a) RETURN a

to something along the lines of

MATCH (a) RETURN a {.*, labels: labels(a)}

. This assumes that no node has an attribute labels else that'd just be overwritten in the output.

aleksanderlech commented 1 year ago

Hello @robsdedude,

After further debugging we found this growing graph variable and were supposed to tell but you were faster :) I changed the query following your suggestion and now its perfect. Thanks for the fast support!