pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.6k stars 1.08k forks source link

Xarray unable to read file that netCDF4 can #5164

Open WardBrian opened 3 years ago

WardBrian commented 3 years ago

What happened:

I am reading files from https://www-air.larc.nasa.gov/pub/NDACC/PUBLIC/stations/mauna.loa.hi/hdf/lidar/.

When passed to xr.open_dataset, the following error occurs

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-36-895975874f7f> in <module>
----> 1 xr.open_dataset(
      2     "/users/bmward/groundbased_lidar.temperature_nasa.jpl002_glass.1.1_mauna.loa.hi_20200103t050130z_20200103t072420z_001.h4",
      3     engine="netcdf4",
      4 )

~/.conda/envs/bg-dev/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    555 
    556     with close_on_error(store):
--> 557         ds = maybe_decode_store(store, chunks)
    558 
    559     # Ensure source filename always stored in dataset object (GH issue #2550)

~/.conda/envs/bg-dev/lib/python3.9/site-packages/xarray/backends/api.py in maybe_decode_store(store, chunks)
    451 
    452     def maybe_decode_store(store, chunks):
--> 453         ds = conventions.decode_cf(
    454             store,
    455             mask_and_scale=mask_and_scale,

~/.conda/envs/bg-dev/lib/python3.9/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
    637         encoding = obj.encoding
    638     elif isinstance(obj, AbstractDataStore):
--> 639         vars, attrs = obj.load()
    640         extra_coords = set()
    641         close = obj.close

~/.conda/envs/bg-dev/lib/python3.9/site-packages/xarray/backends/common.py in load(self)
    111         """
    112         variables = FrozenDict(
--> 113             (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    114         )
    115         attributes = FrozenDict(self.get_attrs())

~/.conda/envs/bg-dev/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in get_variables(self)
    417 
    418     def get_variables(self):
--> 419         dsvars = FrozenDict(
    420             (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
    421         )

~/.conda/envs/bg-dev/lib/python3.9/site-packages/xarray/core/utils.py in FrozenDict(*args, **kwargs)
    451 
    452 def FrozenDict(*args, **kwargs) -> Frozen:
--> 453     return Frozen(dict(*args, **kwargs))
    454 
    455 

~/.conda/envs/bg-dev/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in <genexpr>(.0)
    418     def get_variables(self):
    419         dsvars = FrozenDict(
--> 420             (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
    421         )
    422         return dsvars

~/.conda/envs/bg-dev/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in open_store_variable(self, name, var)
    394         # netCDF4 specific encoding; save _FillValue for later
    395         encoding = {}
--> 396         filters = var.filters()
    397         if filters is not None:
    398             encoding.update(filters)

src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.filters()

src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

RuntimeError: NetCDF: Attempting netcdf-4 operation on netcdf-3 file

However,

import netCDF4
netCDF4.Dataset(
    "/users/bmward/groundbased_lidar.temperature_nasa.jpl002_glass.1.1_mauna.loa.hi_20200103t050130z_20200103t072420z_001.hdf",
)

Does not produce any errors

What you expected to happen:

I expect that xarray be able to load the file

Minimal Complete Verifiable Example:

import xarray
xr.open_dataset(
    "groundbased_lidar.temperature_nasa.jpl002_glass.1.1_mauna.loa.hi_20200103t050130z_20200103t072420z_001.hdf",
    engine="netcdf4",
)

Anything else we need to know?:

Changing the engine to h5netcdf produces a different error, but still fails.

Setting decode_cf=False has no effect.

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.9.1 | packaged by conda-forge | (default, Jan 26 2021, 01:34:10) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.11.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.20.1 scipy: 1.6.2 netCDF4: 1.5.6 pydap: None h5netcdf: 0.10.0 h5py: 3.1.0 Nio: None zarr: None cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.03.1 distributed: 2021.03.1 matplotlib: 3.3.4 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 21.0.1 conda: None pytest: None IPython: 7.22.0 sphinx: 3.5.3
kmuehlbauer commented 3 years ago

@WardBrian It's not hdf5 but hdf4:

$ hdfls groundbased_lidar.aerosol_nasa.jpl002_glass.1.1_mauna.loa.hi_20040109t045500z_20040109t065531z_001.hdf: groundbased_lidar.aerosol_nasa.jpl002_glass.1.1_mauna.loa.hi_20040109t045500z_20040109t065531z_001.hdf: File library version: Major= 4, Minor=2, Release=3 String=HDF Version 4.2 Release 3, January 27, 2008

Unfortunately I have no idea how to get this into xarray, though. But good chance that someone knows how to do this.

WardBrian commented 3 years ago

hdf4

Yes, the only mention I can find of hdf4 for xarray relies on PyNIO, which has been discontinued. If netCDF4 can open the raw file, I'm not sure why xarray can't

WardBrian commented 3 years ago

I've been able to confirm locally that the problem is caused by the call to filters here https://github.com/pydata/xarray/blob/18ed29e4086145c29fde31c9d728a939536911c9/xarray/backends/netCDF4_.py#L395-L399

The line is even commented to say it is netcdf4 specific, but it is called unconditionally. I wrapped it in a try/except and then the file loaded, so I think this may just be an oversight in the backend

kmuehlbauer commented 1 year ago

@WardBrian Coming back to this now, netCDF4-python can obviously read these kind of HDF4 files. We might think about special casing filters here to not be called unconditionally. Thoughts?

dcherian commented 1 year ago

Seems good to me.