pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.5k stars 1.04k forks source link

netCDF IO crash #9125

Open ernimd opened 2 weeks ago

ernimd commented 2 weeks ago

What happened?

I have a use case of computing a function in parallel with dask and inside the function I am opening/reading (no writing) the same .nc dataset in all those parallel calls to do computation based on it. I am getting a bunch (see the log) of different errors (or none) while running it. Most likely it's not designed to be used in this way, or it's an actual bug in either IO implementation or dask itself.

Due to non-deterministic nature you should run the MVE with while python crash.py; do :; done

What did you expect to happen?

I would expect it to fail gracefully, maybe warning about dangerous multi-thread reads.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
from dask import delayed, compute

# create dummy.nc
dummy_data = xr.Dataset(
    {
        "var": (("x", "y"), np.random.rand(10, 10)),
    }
)
dummy_data = dummy_data.chunk({"x": 5, "y": 5}) # not necessary for reproducing...
dummy_data.to_netcdf("dummy.nc")
del dummy_data

def add_columns(value):
    ds = xr.open_dataset(
        "dummy.nc"
    )
    value *= np.random.choice(ds["var"].values.reshape(-1))
    value *= np.random.choice(ds["var"].values.reshape(-1))
    return value

res = []
for _ in range(10):
    res.append(delayed(add_columns)(np.random.rand(1)))
the_ans = compute(*res)

MVCE confirmation

Relevant log output

Error log 1

python3: /io/netcdf-c-4.9.2/libsrc4/nc4internal.c:426: nc4_find_nc_grp_h5: Assertion `my_h5 && my_h5->root_grp' failed.
Aborted (core dumped)

Error log 2

HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5A.c line 1327 in H5Aiterate2(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5VLint.c line 1749 in H5VL_vol_object(): invalid identifier type to function
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 2:
  #000: H5G.c line 548 in H5Gget_create_plist(): not a group ID
    major: Invalid arguments to routine
    minor: Inappropriate type
Segmentation fault (core dumped)

Error log 3

Traceback (most recent call last):
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 1446, in _construct_dataarray
    variable = self._variables[name]
KeyError: 'var'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 1545, in __getitem__
    return self._construct_dataarray(key)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 1448, in _construct_dataarray
    _, name, variable = _get_virtual_variable(self._variables, name, self.sizes)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 207, in _get_virtual_variable
    raise KeyError(key)
KeyError: 'var'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ernie/code/corres/crash_netcdf5.py", line 29, in <module>
    the_ans = compute(*res)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/ernie/code/corres/crash_netcdf5.py", line 21, in add_columns
    value *= np.random.choice(ds["var"].values.reshape(-1))
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/dataset.py", line 1547, in __getitem__
    raise KeyError(
KeyError: "No variable named 'var'. Variables on the dataset include []"

Error log 4

Traceback (most recent call last):
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/utils.py", line 635, in close_on_error
    yield
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 660, in open_dataset
    ds = store_entrypoint.open_dataset(
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/store.py", line 43, in open_dataset
    vars, attrs = filename_or_obj.load()
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/common.py", line 251, in load
    (_decode_variable_name(k), v) for k, v in self.get_variables().items()
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 462, in get_variables
    return FrozenDict(
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/utils.py", line 443, in FrozenDict
    return Frozen(dict(*args, **kwargs))
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 463, in <genexpr>
    (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 424, in open_store_variable
    attributes = {k: var.getncattr(k) for k in var.ncattrs()}
  File "src/netCDF4/_netCDF4.pyx", line 4585, in netCDF4._netCDF4.Variable.ncattrs
  File "src/netCDF4/_netCDF4.pyx", line 1521, in netCDF4._netCDF4._get_att_names
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
AttributeError: NetCDF: Not a valid ID

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ernie/code/corres/crash_netcdf5.py", line 29, in <module>
    the_ans = compute(*res)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/ernie/code/corres/crash_netcdf5.py", line 17, in add_columns
    ds = xr.open_dataset(
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/api.py", line 573, in open_dataset
    backend_ds = backend.open_dataset(
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 659, in open_dataset
    with close_on_error(store):
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/core/utils.py", line 637, in close_on_error
    f.close()
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/netCDF4_.py", line 578, in close
    self._manager.close(**kwargs)
  File "/home/ernie/.miniconda/envs/corres/lib/python3.10/site-packages/xarray/backends/file_manager.py", line 234, in close
    file.close()
  File "src/netCDF4/_netCDF4.pyx", line 2627, in netCDF4._netCDF4.Dataset.close
  File "src/netCDF4/_netCDF4.pyx", line 2590, in netCDF4._netCDF4.Dataset._close
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2024.3.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.0 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 Nio: None zarr: 2.17.2 cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: 2024.4.2 distributed: 2024.4.2 matplotlib: 3.8.4 cartopy: None seaborn: None numbagg: None fsspec: 2024.3.1 cupy: None pint: None sparse: None flox: 0.9.7 numpy_groupies: 0.11.1 setuptools: 69.5.1 pip: 24.0 conda: None pytest: 7.4.0 mypy: None IPython: 8.22.2 sphinx: None
keewis commented 2 weeks ago

This also appears with open_mfdataset, but appears to only happen with a threading scheduler.

In any case, the reason is that netcdf4 is not thread-safe anymore since netcdf4=1.6.1 (I think). Which means this has unfortunately been around for a while and is known (see e.g. #7079), but nobody had the time / skills / persistence to figure out what exactly causes this – race conditions are just that tricky to debug.