vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] The process cannot access the file because it is being used by another process #2119

Closed abf7d closed 2 years ago

abf7d commented 2 years ago

Description I am working on an application that uses Vaex for accessing data from a feather file. We are creating virtual columns in a dataframe that store boolean values which are used to filter rows of data in the dataset. Everytime a new filter is made a file is saved to cache the data. We are using export_feather to save the filter to a file, we are dropping the virual column, then we are joining with the cache. Here is the part of the code that is being used:

    filename = f"filter__{fiter_id}.feather"
    df[[f"filter__{filter_id}"]].export_feather(
        str(export_path.joinpath(filename)).replace("\\", "/")
    )

    # Once the file is saved, drop the virtual column and join the cached selection
    df.drop([f"filter__{filter_id}"], inplace=True)
    df.join(vaex.open(export_path.joinpath(filename)), inplace=True)

In the application we look to clean up and delete cached files. When we try to delte files with

    os.chmod(file,0o777) 
    os.remove(file)

We get the error PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'data\\my-collection\\.my-app\\filters\\data.feather\\filter__1.feather'

If I remove the df.drop and df.join when creating the files, the error doesn't occur and the files are deleted. I tried looking at the vaex source code to see what the df.join function does, but I'm rather new to python and didn't see anything that jumped out at me. How are the file and dataset being handled and why is the file handle not released? In this context, what process is using the file and how can I close it so I can delete the file.

Software information

JovanVeljanoski commented 2 years ago

Would df.close() work (right before you need to delete the file) ?

abf7d commented 2 years ago

Indeed, df.close() worked! Thank you so much!

JovanVeljanoski commented 2 years ago

You're welcome - glad it worked.

A little explanation: df.close() severs the link to the actual file on disk, so the OS does not complain that it is being used (since vaex does memory mapping and reads the file from disk (i.e. streams it from disk), rather then reading it once and putting it in memory and be done with it.)

abf7d commented 2 years ago

Thank you @JovanVeljanoski, that info is helpful. Thanks!