pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.6k stars 1.08k forks source link

reading netcdf with engine=scipy fails with a typeerror under certain conditions #8693

Closed eivindjahren closed 4 months ago

eivindjahren commented 8 months ago

What happened?

Saving and loading from netcdf with engine=scipy produces an unexpected valueerror on read. The file seems to be corrupted.

What did you expect to happen?

reading works just fine.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr
ds = xr.Dataset(
    {
        "values": (
            ["name", "time"],
            np.array([[]], dtype=np.float32).T,
        )
    },
    coords={"time": [1], "name": []},
).expand_dims({"index": [0]})

ds.to_netcdf("file.nc", engine="scipy")
_ = xr.open_dataset("file.nc", engine="scipy")

MVCE confirmation

Relevant log output

KeyError                                  Traceback (most recent call last)
File .../python3.11/site-packages/xarray/backends/file_manag
er.py:211, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    210 try:
--> 211     file = self._cache[self._key]
    212 except KeyError:

File .../python3.11/site-packages/xarray/backends/lru_cache.
py:56, in LRUCache.__getitem__(self, key)
     55 with self._lock:
---> 56     value = self._cache[key]
     57     self._cache.move_to_end(key)

KeyError: [<function _open_scipy_netcdf at 0x7fe96afa9120>, ('/home/eivind/Projects/ert/file.nc',), 
'r', (('mmap', None), ('version', 2)), '264ec6b3-78b3-4766-bb41-7656d6a51962']

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[1], line 18
      4 ds = (
      5     xr.Dataset(
      6         {
   (...)
     15     .expand_dims({"index": [0]})
     16 )
     17 ds.to_netcdf("file.nc", engine="scipy")
---> 18 _ = xr.open_dataset("file.nc", engine="scipy")

File .../python3.11/site-packages/xarray/backends/api.py:572
, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, d
ecode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked
_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    560 decoders = _resolve_decoders_kwargs(
    561     decode_cf,
    562     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    568     decode_coords=decode_coords,
    569 )
    571 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 572 backend_ds = backend.open_dataset(
    573     filename_or_obj,
    574     drop_variables=drop_variables,
    575     **decoders,
    576     **kwargs,
    577 )
    578 ds = _dataset_from_backend_dataset(
    579     backend_ds,
    580     filename_or_obj,
   (...)
    590     **kwargs,
    591 )
    592 return ds

File .../python3.11/site-packages/xarray/backends/scipy_.py:
315, in ScipyBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, con
cat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, mode, format, group, mm
ap, lock)
    313 store_entrypoint = StoreBackendEntrypoint()
    314 with close_on_error(store):
--> 315     ds = store_entrypoint.open_dataset(
    316         store,
    317         mask_and_scale=mask_and_scale,
    318         decode_times=decode_times,
    319         concat_characters=concat_characters,
    320         decode_coords=decode_coords,
    321         drop_variables=drop_variables,
    322         use_cftime=use_cftime,
    323         decode_timedelta=decode_timedelta,
    324     )
    325 return ds

File .../python3.11/site-packages/xarray/backends/store.py:4
3, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, conca
t_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     29 def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporting **kwargs
     30     self,
     31     filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
   (...)
     39     decode_timedelta=None,
     40 ) -> Dataset:
     41     assert isinstance(filename_or_obj, AbstractDataStore)
---> 43     vars, attrs = filename_or_obj.load()
     44     encoding = filename_or_obj.get_encoding()
     46     vars, attrs, coord_names = conventions.decode_cf_variables(
     47         vars,
     48         attrs,
   (...)
     55         decode_timedelta=decode_timedelta,
     56     )

File .../python3.11/site-packages/xarray/backends/common.py:
210, in AbstractDataStore.load(self)
    188 def load(self):
    189     """
    190     This loads the variables and attributes simultaneously.
    191     A centralized loading function makes it easier to create
   (...)
    207     are requested, so care should be taken to make sure its fast.
    208     """
    209     variables = FrozenDict(
--> 210         (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    211     )
    212     attributes = FrozenDict(self.get_attrs())
    213     return variables, attributes

File .../python3.11/site-packages/xarray/backends/scipy_.py:
181, in ScipyDataStore.get_variables(self)
    179 def get_variables(self):
    180     return FrozenDict(
--> 181         (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
    182     )

File .../python3.11/site-packages/xarray/backends/scipy_.py:
170, in ScipyDataStore.ds(self)
    168 @property
    169 def ds(self):
--> 170     return self._manager.acquire()

File .../python3.11/site-packages/xarray/backends/file_manag
er.py:193, in CachingFileManager.acquire(self, needs_lock)
    178 def acquire(self, needs_lock=True):
    179     """Acquire a file object from the manager.
    180 
    181     A new file is only opened if it has expired from the
   (...)
    191         An open file object, as returned by ``opener(*args, **kwargs)``.
    192     """
--> 193     file, _ = self._acquire_with_cache_info(needs_lock)
    194     return file

File .../python3.11/site-packages/xarray/backends/file_manag
er.py:217, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    215     kwargs = kwargs.copy()
    216     kwargs["mode"] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
    218 if self._mode == "w":
    219     # ensure file doesn't get overridden when opened again
    220     self._mode = "a"

File .../python3.11/site-packages/xarray/backends/scipy_.py:
109, in _open_scipy_netcdf(filename, mode, mmap, version)
    106     filename = io.BytesIO(filename)
    108 try:
--> 109     return scipy.io.netcdf_file(filename, mode=mode, mmap=mmap, version=version)
    110 except TypeError as e:  # netcdf3 message is obscure in this case
    111     errmsg = e.args[0]

File .../python3.11/site-packages/scipy/io/_netcdf.py:278, i
n netcdf_file.__init__(self, filename, mode, mmap, version, maskandscale)
    275 self._attributes = {}
    277 if mode in 'ra':
--> 278     self._read()

File .../python3.11/site-packages/scipy/io/_netcdf.py:607, i
n netcdf_file._read(self)
    605 self._read_dim_array()
    606 self._read_gatt_array()
--> 607 self._read_var_array()

File .../python3.11/site-packages/scipy/io/_netcdf.py:688, i
n netcdf_file._read_var_array(self)
    685     data = None
    686 else:  # not a record variable
    687     # Calculate size to avoid problems with vsize (above)
--> 688     a_size = reduce(mul, shape, 1) * size
    689     if self.use_mmap:
    690         data = self._mm_buf[begin_:begin_+a_size].view(dtype=dtype_)

TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.4 (main, Dec 7 2023, 15:43:41) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.2.0-39-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2024.1.1 pandas: 2.1.1 numpy: 1.26.1 scipy: 1.11.3 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: 3.10.0 Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.8.0 cartopy: None seaborn: 0.13.1 numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 63.4.3 pip: 23.3.1 conda: None pytest: 7.4.4 mypy: 1.8.0 IPython: 8.17.2 sphinx: 7.2.6
kmuehlbauer commented 8 months ago

@eivindjahren Thanks for bringing this to attention.

From the description it's a bit unclear which engine you want/need to use. You mentioned engine=netcdf (should that be netcdf4?) and in the code example you use engine="scipy". From what I can tell engine scipy uses NETCDF3 data model which has some restrictions on the dimensions of variables. So it understands only 1 dimension as unlimited which need to be the first dimension of the variable.

If you do

$ ncdump file.nc
ncdump: file.nc: NetCDF: NC_UNLIMITED in the wrong index

But if we move the zero dimension to the front before saving:

ds = ds.transpose("name", "index", "time")

This isn't even recognized by ncdump:

$ ncdump file.nc
ncdump: file.nc: NetCDF: Unknown file format

Whereas it can be read perfectly fine with engine="scipy".

I did not explore further, but there is something weird going on with engine scipy here.

eivindjahren commented 8 months ago

From the description it's a bit unclear which engine you want/need to use. You mentioned engine=netcdf (should that be netcdf4?)

Sorry, I ment engine=scipy, that was a typo. We have decided to use that in our application for performance reasons.

kmuehlbauer commented 8 months ago

Sorry, I ment engine=scipy, that was a typo. We have decided to use that in our application for performance reasons.

A bit offtopic now, but can you elaborate a bit what performance benefits you have with NETCDF3 format in your use case? What is preventing you from using netcdf4 backend?

For the scipy backend issue I'd appreciate if someone with more knowledge in that part could chime in here.

eivindjahren commented 8 months ago

Sorry, I ment engine=scipy, that was a typo. We have decided to use that in our application for performance reasons.

A bit offtopic now, but can you elaborate a bit what performance benefits you have with NETCDF3 format in your use case? What is preventing you from using netcdf4 backend?

I don't have the specifics about the benchmarks that were performed, but I will see what I can find. We have planned to change to netcdf4 because we want to use datetime[64].

kmuehlbauer commented 4 months ago

Closing for now. If this is still an issue please reopen with updated information. Thanks!