pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

Deleting a recently opened netCDF4 #9410

Open abinashmk opened 2 weeks ago

abinashmk commented 2 weeks ago

As a part of my task, I had to download, read, process and then finally delete the netCDF files after a certain number of files have been read due to storage limitations. But even after manually closing the files or using context manager:

with xarray.open_dataset(filePath) as ds: 
     #processing code
os.remove(filePath)

OR

ds=xarray.open_dataset(filePath)
#processing code
ds.close()
os.remove(filePath)

, I get

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

I did refer to the previously reported issues #1629 and #2887. But using context manager or changing the engine through which netCDF file is read was of no help. Is there any way to work around this?

welcome[bot] commented 2 weeks ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 2 weeks ago

Could you make an MCVE to copy & paste, using the context manager?

abinashmk commented 2 weeks ago

Do you mean copying the contents of the file or the file itself?

max-sixty commented 2 weeks ago

The file should be created inline.

Thanks!

abinashmk commented 2 weeks ago

I am a bit lost here. What I am trying to do doesn't seem to be related to the creation of the file. There are two dimensions in the dataset, and I am trying to slice a portion from ds as in the code below, after which I have no use for the original file. I need to delete it as it is big. An MCVE of something I did would look like:

import xarray
import os
with xarray.open_dataset(filePath) as ds:
        cropped_ds=ds.sel(x=slice(x1,x2), y=slice(y1,y2)) #x and y are the dimensions in the dataset

os.remove(filePath)

Assuming it was because of the processing that happened in between, I replace it with just a print statement.

import xarray
import os
with xarray.open_dataset(filePath) as ds:
        print(ds)

os.remove(filePath)

However, the problem persisted. I hope I was able to give what you asked in this comment. Please tell me if you need any other info.

max-sixty commented 2 weeks ago

Sorry if I'm being unclear. Have a look at the docs for an MCVE in the issue template. The example should be able to be copy-pasted into a new python prompt.

keewis commented 2 weeks ago

the issue is that we don't have access to your file (nor should we be able to access it), instead what we're looking for is if you can create a dummy dataset, save that to disk and allow us to reproduce your issue that way. For example:

filepath = ...
ds = xr.Dataset({"a": (["x", "y"], np.ones(shape=(10, 12), dtype="float64"))}, coords={"x": range(10), "y": range(12)})
ds.to_netcdf(filepath)

... # code to reproduce your issue

(you might have to adapt the dummy dataset to actually reproduce your issue, this is just an example)

abinashmk commented 2 weeks ago

So, I was trying to write the MCVE for the issue I was facing. The code looks something like this:

import xarray as xr
import numpy as np
import os

# Create latitude and longitude arrays
lat = np.arange(-90, 90, 0.01)
lon = np.arange(-180, 180, 0.01)

# Create a 2D array for temperature, here using a simple example like a sine function for variation
temperature = np.sin(np.sqrt(lat[:, np.newaxis]**2 + lon[np.newaxis, :]**2))

# Create an xarray Dataset
ds = xr.Dataset(
    {
        "TEMPERATURE": (["lat", "lon"], temperature)
    },
    coords={
        "lat": lat,
        "lon": lon
    }
)

# Display the created dataset
ds.to_netcdf("sample.nc")

with xr.open_dataset("sample.nc") as ds:
        cropped_ds=ds.sel(lon=slice(-95,-94), lat=slice(30,28))
os.remove("sample.nc")

I can delete the file. But when I try to do the same for the data I am working on, it throws an error. Thus, I am adding the link to the file, which is open-source data downloaded from Copernicus Land Services. The following is the code that I used that gave out the error.

filePath=r"c_gls_LAI300-RT0_201712310000_GLOBE_PROBAV_V1.0.1.nc"
with xr.open_dataset(filePath) as ds:
        print(ds)
os.remove("c_gls_LAI300-RT0_201712310000_GLOBE_PROBAV_V1.0.1.nc")

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'c_gls_LAI300-RT0_201712310000_GLOBE_PROBAV_V1.0.1.nc'

Some things that I have found:

  1. The file is about 1.5 GB but if I try to write the file again, you get a warning about file size being about 5.3 GB
    filePath=r"c_gls_LAI300-RT0_201712310000_GLOBE_PROBAV_V1.0.1.nc"
    with xr.open_dataset(filePath, engine="netcdf4") as ds:
        print(ds)
    ds.to_netcdf("sample2.nc")

MemoryError: Unable to allocate 5.30 GiB for an array with shape (47040, 120960) and data type bool

The dataset I created at the very beginning also requires about 5GB of space and the code executed without any issues. If I don't specify the engine, it issues an error message saying 22 GB of space was required.

  1. I can process the data, save it to a different file, and delete the newly saved file.
    filePath=r"c_gls_LAI300-RT0_201712310000_GLOBE_PROBAV_V1.0.1.nc"
    with xr.open_dataset(filePath) as ds:
        cropped_ds=ds.sel(lon=slice(-95,-94), lat=slice(30,28))
    cropped_ds.to_netcdf("sample3.nc")
    with xr.open_dataset("sample3.nc") as ds:
        print(ds)
    os.remove("sample3.nc")

Do tell me if you require further info.

max-sixty commented 2 weeks ago

I can delete the file. But when I try to do the same for the data I am working on, it throws an error.

That is quite surprising!

Without some repro that doesn't involve downloading 1.5G of data, it's unlikely to get much traction.

Does making a smaller-but-not-tiny file — say 150MB — trigger the error?

abinashmk commented 2 weeks ago

The size is not causing the problem. I tried creating large and small files (about 5GB). I could read and delete it without any issues. I even cropped the data for a particular region from the above file and saved it on a separate file. I could read it and delete it. I can't think of a way to reproduce the same error here.

max-sixty commented 2 weeks ago

OK so to confirm: this code fails for this specific file, but we can't find any other file where the problem occurs?

filePath=r"c_gls_LAI300-RT0_201712310000_GLOBE_PROBAV_V1.0.1.nc"
with xr.open_dataset(filePath) as ds:
        print(ds)
os.remove("c_gls_LAI300-RT0_201712310000_GLOBE_PROBAV_V1.0.1.nc")

V surprising if so! Again my guess is that it's too specific a problem to get traction, but we can reopen if there's a more reproducible case...