Open hmaarrfk opened 11 months ago
Ok this was bugging me enough, but here is a reproducer that just "runs" without any data:
import xarray
engine = 'netcdf4'
dataset = xarray.Dataset()
dataset.coords['x'] = ['a']
dataset.to_netcdf('mrc.nc')
dataset = xarray.open_dataset('mrc.nc', engine=engine)
for i in range(10):
print(f"i={i}")
xarray.open_dataset('mrc.nc', engine=engine)
the key was making the coordinate a H5T_STRING
type.
Sorry to rapid fire post, but the following "hack" seems to resolve the issues I am observing:
diff --git a/xarray/backends/netCDF4_.py b/xarray/backends/netCDF4_.py
index f21f15bf..8f1243da 100644
--- a/xarray/backends/netCDF4_.py
+++ b/xarray/backends/netCDF4_.py
@@ -394,8 +394,8 @@ class NetCDF4DataStore(WritableCFDataStore):
kwargs = dict(
clobber=clobber, diskless=diskless, persist=persist, format=format
)
- manager = CachingFileManager(
- netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
+ manager = DummyFileManager(
+ netCDF4.Dataset(filename, mode=mode, **kwargs)
)
return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
I have a feeling some reference isn't being kept, and the file is being free'ed somehow during garbage collection.
While this "hack" somewhat "works", if I try to open the same file with two different backends, it really likes to complain.
It may be that libnetcdf4 just expects to be in control of the file at a all times..
Running a similar segfaulting benchmark on xarray's main branch
import xarray
import numpy as np
write_engine = 'h5netcdf'
hold_engine = 'h5netcdf'
read_engine = 'netcdf4'
filename = f'{write_engine}_mrc.nc'
# %%
dataset = xarray.Dataset()
dataset.coords['x'] = ['a']
dataset.coords['my_version'] = '1.2.3.4.5.6'
dataset['images'] = (('x', ), np.zeros((1,)))
dataset.to_netcdf(filename, engine=write_engine)
# %%
dataset = xarray.open_dataset(filename, engine=hold_engine)
for i in range(100):
print(f"i={i}")
xarray.open_dataset(filename, engine=read_engine)
write engine | ||
---|---|---|
hold/read engine | h5netcdf | netcdf4 |
netcdf4/netcdf4 | pass | segfault |
netcdf4/h5netcdf | pass | segfault |
h5netcdf/h5netcdf | pass | pass |
h5netcdf/netcdf4 | pass | pass |
While I know these issues are hard, can anybody else confirm that this happens on their system as well? Maybe my machine is really weird....
As a final reproducer:
import xarray
xarray.set_options(warn_for_unclosed_files=True)
# Also needs a small patch....
"""
diff --git a/xarray/backends/file_manager.py b/xarray/backends/file_manager.py
index df901f9a..a2e8af03 100644
--- a/xarray/backends/file_manager.py
+++ b/xarray/backends/file_manager.py
@@ -252,11 +252,10 @@ class CachingFileManager(FileManager):
self._lock.release()
if OPTIONS["warn_for_unclosed_files"]:
- warnings.warn(
+ print(
f"deallocating {self}, but file is not already closed. "
"This may indicate a bug.",
- RuntimeWarning,
- stacklevel=2,
+ flush=True
)
def __getstate__(self):
"""
dataset = xarray.Dataset()
dataset.coords['x'] = ['a']
dataset.to_netcdf('mrc.nc', engine='netcdf4')
dataset = xarray.open_dataset('mrc.nc', engine='netcdf4')
for i in range(100):
print(f"i={i}")
xarray.open_dataset('mrc.nc', engine='netcdf4')
Gives the output:
i=0 deallocating CachingFileManager(<class 'netCDF4._netCDF4.Dataset'>, '/home/mark/git/wgpu/mrc.nc', mode='r', kwargs={'clobber': True, 'diskless': False, 'persist': False, 'format': 'NETCDF4'}, manager_id='120349e5-9287-4535-a724-588aa78cf9d0'), but file is not already closed. This may indicate a bug.
I can confirm that your original reproducer segfaults on my system (Linux/x86_64). I also agree with your diagnosis that this seems to be an issue with the caching file manager.
FWIW, adding dataset.close()
after the first open_dataset
does seem to solve things. That said, I don't think this should be required so, for some reason, we're not using the cached file object.
thanks for confirming. it has been puzzling me for no end.
I'm also struggling with the problem and I have simplified the code a bit more:
import xarray
import numpy as np
import os
import sys
filename = 'test_mrc.nc'
if not os.path.exists(filename):
dataset_w = xarray.Dataset()
dataset_w['x'] = ['a']
dataset_w.to_netcdf(filename)
print("try open 1", file=sys.stderr)
dataset = xarray.open_dataset(filename)
print("try open 2", file=sys.stderr)
dataset2 = xarray.open_dataset(filename)
dataset2 = None
print("try open 3", file=sys.stderr)
dataset3 = xarray.open_dataset(filename)
print("success")
The problem only occurs if certain features of netcdf4 are used on file (e.g. superblock 2, strings), but those are common. The cache-manager fails to handle the file if it was opened twice and one of these two handles go out of scope (here dataset2). The next open (dataset3) throws a segmentation-fault.
I've tested with v2023.6.0
(latest version in conda-forge / python 3.11.7, linux)
The above example succeeds in my case (after removing the extra .):
(venv) $ pip freeze | grep xarray
xarray==2024.3.0
(venv) $ python3 --version
Python 3.11.3
(venv) $ python3 test.py
try open 1
try open 2
try open 3
success
No conda; just a regular python virtual environment
I just managed to upgrade my xarray to 2024.03.0 (pinning the version) and still get the error, though it works sometimes?
$ conda install xarray=2024.3.0
...
$ conda list | grep xarray
xarray 2024.3.0 pyhd8ed1ab_0 conda-forge
$ python3 xarray_segfault.py
try open 1
try open 2
try open 3
success
$ python3 xarray_segfault.py
try open 1
try open 2
try open 3
Segmentation fault (core dumped)
(this was independ if the test_mrc.nc file existed or not...)
from a quick bisect
, it appears this particular issue was introduced by #4879. Not sure what to do to fix that.
We can observe, though, that the file manager id in the least-recently used cache changes every time we open it, but that the underlying netCDF4.Dataset
object always stays the same. So that might be a hint?
I think it is related to https://github.com/pydata/xarray/discussions/7359#discussioncomment-4314359
What happened?
The following code yields a segfault on my machine (and many other machines with a similar environment)
tiny.nc.txt mrc.nc.txt
What did you expect to happen?
Not to segfault.
Minimal Complete Verifiable Example
hand crafting the file from start to finish seems to not segfault:
MVCE confirmation
Relevant log output
Anything else we need to know?
At first I thought it was deep in hdf5, but I am less convinced now
xref: https://github.com/HDFGroup/hdf5/issues/3649
Environment