kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.07k stars 77 forks source link

Resource contention on Windows using the Python API #3399

Open ZmeiGorynych opened 2 weeks ago

ZmeiGorynych commented 2 weeks ago

When I try to insert nodes and relations in a loop, like below, sooner or later I get the following exception: (<class 'RuntimeError'>, RuntimeError('remove: The process cannot access the file because it is being used by another process.: "C:\\Users\\Egor\\Dropbox\\Code\\motleycrew\\examples\\bug_db\\nodes.statistics_and_deleted.ids.wal"'), <traceback object at 0x000001E21B6AB280>) This only happens on Windows (at least not on a Mac), but on Windows it always does :(. Sleeping between inserts makes the problem less likely, but it still appears sometimes.

Running kuzu==0.3.2 installed with pip, on Python 3.11 on Window.s

import shutil
from pathlib import Path
from time import sleep
import os
import shutil

import kuzu

WORKING_DIR = Path(__file__).parent
DB_PATH = WORKING_DIR / "bug_db"
node_table_name = "entity"
rel_table_name = "links"

if os.path.isdir(DB_PATH):
    shutil.rmtree(DB_PATH)

database = kuzu.Database(DB_PATH)
connection = kuzu.Connection(database)
node_tables = connection._get_node_table_names()

if node_table_name not in node_tables:
    connection.execute("CREATE NODE TABLE %s (ID STRING, PRIMARY KEY(ID))" % node_table_name)
rel_tables = connection._get_rel_table_names()
rel_tables = [rel_table["name"] for rel_table in rel_tables]
if rel_table_name not in rel_tables:
    connection.execute(
        "CREATE REL TABLE {} (FROM {} TO {}, predicate STRING)".format(
            rel_table_name, node_table_name, node_table_name
        )
    )

def create_relation(subj: str, obj: str, rel: str) -> None:
    connection.execute(
        (
            "MATCH (n1:{}), (n2:{}) WHERE n1.ID = $subj AND n2.ID = $obj "
            "CREATE (n1)-[r:{} {{predicate: $pred}}]->(n2)"
        ).format(node_table_name, node_table_name, rel_table_name),
        {"subj": subj, "obj": obj, "pred": rel},
    )

def create_entity(entity: str) -> None:
    connection.execute(
        "CREATE (n:%s {ID: $entity})" % node_table_name,
        {"entity": entity},
    )

create_entity("entity0")

sleep_time = 0
for i in range(10):
    print(i + 1)
    create_entity(f"entity{i+1}")
    sleep(sleep_time)
    create_relation("entity1", f"entity{i+1}", "rel1")
    sleep(sleep_time)
ZmeiGorynych commented 2 weeks ago

Just noticed that this doesn't happen if I make the relation name unique, eg f"rel{i+1}" instead of "rel1". Do the relation names have to be unique even when they connect different nodes?

prrao87 commented 2 weeks ago

Hi @ZmeiGorynych, relationship tables reference the underlying node table's primary keys in the FROM and TO columns (you can only have one relationship between any given pair of nodes based on their primary keys).

Just clarifying the intent behind this snippet, which doesn't seem right (or safe) to me:

sleep_time = 0
for i in range(10):
    print(i + 1)
    create_entity(f"entity{i+1}")
    sleep(sleep_time)
    create_relation("entity1", f"entity{i+1}", "rel1")
    sleep(sleep_time)

When you just create entity1 within the for loop and then immediately reference it to create the relationship, the transaction may not have been committed yet (nor the primary key index created/updated safely for the relationship table to reference). From an I/O perspective, you're much better off creating the nodes beforehand, so that the node table's hash index can get up to date and all the transactions are committed safely. Then you would be able to safely reference these primary keys in a subsequent loop to update the relationships. Maybe @ray6080 might be able to offer more advice if I've missed anything here.

ZmeiGorynych commented 2 weeks ago

Thanks a lot for the quick answer! However, I would challenge it on several counts: firstly, as I said the exception vanishes when I replace the line create_relation("entity1", f"entity{i+1}", f"rel") with create_relation("entity1", f"entity{i+1}", f"rel{i+1}"), and it doesn't seem like that would happen if your explanation was correct.

Secondly, the above is a very natural calling pattern, think of it as having a function add_child that is called several times in a row, and each time it first creates a child node and then a relationship between it and entity1.

Finally, I find it very alarming that you appear to say that a collision between subsequent execute methods if they are not called in a certain magic order is reasonable behavior for kuzu. Is there no way to make sure each call of the execute method returns only after all the commits have been made, all the indices updated, etc? I don't mind waiting a bit longer on each call, but having random errors like this pop up on a logically sound sequence of calls might be a deal breaker for us using kuzu if there is no fix or workaround :(

semihsalihoglu-uw commented 2 weeks ago

Thanks for reporting the bug. Just to weigh in: I don't see anything wrong with the query. Each of your calls to connection.execute should auto-commit before returning, so you should not even need to sleep in between these. This looks like a bug with our management of the nodes.statistics_and_deleted.ids.wal file. Let's have @ray6080 look into this bug and fix it.

ZmeiGorynych commented 2 weeks ago

Thanks @semihsalihoglu-uw! Relieved to have you agree that this is indeed a bug :) Probably not caught yet because your tests are likely on a Linux, and Windows can be a lot more anal about locking files. Hopefully the fact that the bug disappears if the relation names are all distinct will make it easier to locate.

ZmeiGorynych commented 1 week ago

Just got a very similar error in a different context: (<class 'RuntimeError'>, RuntimeError('remove: The process cannot access the file because it is being used by another process.: "C:\\Users\\Egor\\Dropbox\\Code\\motleycrew\\examples\\research_agent\\research_db\\catalog.kz.wal"'), <traceback object at 0x00000A8CDAEA8D80>) while executing a very basic query 'CREATE REL TABLE is_subquestion (FROM Question TO Question)'

Is intermittent, sometimes it arises when I call the above query and sometimes not in exactly the same code.

Is there any ETA on resolving the above and hopefully also this one? :)

Thanks a lot!

ray6080 commented 1 week ago

Hey @ZmeiGorynych , we will take a look into it today and get back to u on the ETA. prioritizing this now.

benjaminwinger commented 1 week ago

I haven't been able to reproduce this, but it could just be from another program accessing the wal file when it is being removed after committing. Is it possible that it's caused by Dropbox accessing the files? Have you tried disabling anti-virus scanning for the database directory?

ZmeiGorynych commented 1 week ago

Ah, nice catch! After moving the disk location to a non-Dropbox directory, the error disappeared (or at least hasn't appeared after a dozen or so runs). Still, this sounds like something others might encounter - would it be worth waiting for (a second?) and retrying if this error occurs, instead of failing right away?

benjaminwinger commented 1 day ago

That might work, though I think a more reliable alternative would be to avoid removing the WAL files at all and to just empty them instead.