neo4j-contrib / neo4j-apoc-procedures

Awesome Procedures On Cypher for Neo4j - codenamed "apoc"                     If you like it, please ★ above ⇧            
https://neo4j.com/labs/apoc
Apache License 2.0
1.68k stars 494 forks source link

apoc.load.csv does not close the file on consumption end #4078

Open Ava-S opened 1 month ago

Ava-S commented 1 month ago

I use Python to preprocess a file and then load it into the database. More specifically, I have defined the following import flow:

  1. Users specify the location of the input file
  2. The code preprocesses the file
  3. The code requests the import directory and moves it to the import folder of the database
  4. The file is imported using apoc.load.csv
  5. Once the import is done, I tidy up by deleting the file from the import folder.

However, after recently upgrading my database from version 5.9.0 to 5.10.0, I'm not allowed to delete the file from the import folder, as it is still being consumed by the database. The error still prevails in version 5.17 (last tested).

Expected Behavior (Mandatory)

After a file imported using apoc.load.csv, the file should be closed on consumption end, so that other processes can access the file.

Actual Behavior (Mandatory)

The issue arises when I attempt to delete the file post-import. I encounter a PermissionError, signaling that the file is still in use by another process. It seems the database is holding onto the file longer than anticipated, causing a conflict with my cleanup operation.

More specifically, I get this error: PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '<ne04j>\\import\\import_file.csv

How to Reproduce the Problem

Simple Dataset (where it's possibile)

The specific dataset does not matter, it happens with any dataset I try to import.

Python Code

This is the Python code I use to import and delete the file.

# Import file
def import_file(tx):
    result = tx.run('''CALL apoc.periodic.iterate('
                        CALL apoc.load.csv("import_file.csv") yield map as row return row',
                        'CREATE (record:Record)
                        SET record += row'
                    , {batchSize:10000, parallel:true, retries: 1});''')

with self.driver.session(database="neo4j") as session:
        session.execute_write(import_file)

# Delete the file from the import directory
path = Path(self.get_import_directory(), "import_file.csv")
os.remove(path)

Steps (Mandatory)

  1. Import data using apoc.load.csv with the Python neo4j driver
  2. Delete the file directly afterwards using Python

Specifications (Mandatory)

Currently used versions

Versions

vga91 commented 1 week ago

The error seems to occur also without apoc.periodic.iterate, and even running the apoc.load.csv directly on neo4j browser/desktop, without using python code trying to delete the file via File Explorer.

It could probably be an error in neo4j itself, as the code regarding apoc.load.csv has not changed.

I opened an issue on the neo4j kernel repository, to investigate both sides: https://github.com/neo4j/neo4j/issues/13480