pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

nbytes not available for lazy loaded array and so can't print(ds) #9185

Closed TimothyCera-NOAA closed 3 months ago

TimothyCera-NOAA commented 3 months ago

What happened?

We use the grib2io backend to read GRIB2 formatted files. Started to have problem printing the summary of the dataset to the screen with the v2024.02.0 release. I suspect the problem is from #8702

Trying to print a dataset will fail trying to find nbytes.

The grib2io backend opens the file lazily, which means you are printing the summary of a MemoryCachedArray which doesn't have nbytes, nor is able to calculate.

Loading the data into memory and then the print(ds1) works fine.

import xarray as xr
filters = {
        "productDefinitionTemplateNumber": 0,
        "typeOfFirstFixedSurface": 1,
        "shortName": "TMP",
        }
ds1 = xr.open_dataset(
        "gfs_20221107/gfs.t00z.pgrb2.1p00.f012_subset",
        engine="grib2io",
        filters=filters,
    )
print(ds1)
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 print(ds1)

File ~/anaconda3/envs/default311/lib/python3.11/site-packages/xarray/core/dataset.py:2569, in Dataset.__repr__(self)
   2568 def __repr__(self) -> str:
-> 2569     return formatting.dataset_repr(self)

File ~/anaconda3/envs/default311/lib/python3.11/reprlib.py:21, in recursive_repr.<locals>.decorating_function.<locals>.wrapper(self)
     19 repr_running.add(key)
     20 try:
---> 21     result = user_function(self)
     22 finally:
     23     repr_running.discard(key)

File ~/anaconda3/envs/default311/lib/python3.11/site-packages/xarray/core/formatting.py:717, in dataset_repr(ds)
    715 @recursive_repr("<recursive Dataset>")
    716 def dataset_repr(ds):
--> 717     nbytes_str = render_human_readable_nbytes(ds.nbytes)
    718     summary = [f"<xarray.{type(ds).__name__}> Size: {nbytes_str}"]
    720     col_width = _calculate_col_width(ds.variables)

File ~/anaconda3/envs/default311/lib/python3.11/site-packages/xarray/core/dataset.py:1544, in Dataset.nbytes(self)
   1536 @property
   1537 def nbytes(self) -> int:
   1538     """
   1539     Total bytes consumed by the data arrays of all variables in this dataset.
   1540 
   1541     If the backend array for any variable does not include ``nbytes``, estimates
   1542     the total bytes for that array based on the ``size`` and ``dtype``.
   1543     """
-> 1544     return sum(v.nbytes for v in self.variables.values())

File ~/anaconda3/envs/default311/lib/python3.11/site-packages/xarray/core/dataset.py:1544, in <genexpr>(.0)
   1536 @property
   1537 def nbytes(self) -> int:
   1538     """
   1539     Total bytes consumed by the data arrays of all variables in this dataset.
   1540 
   1541     If the backend array for any variable does not include ``nbytes``, estimates
   1542     the total bytes for that array based on the ``size`` and ``dtype``.
   1543     """
-> 1544     return sum(v.nbytes for v in self.variables.values())

File ~/anaconda3/envs/default311/lib/python3.11/site-packages/xarray/namedarray/core.py:491, in NamedArray.nbytes(self)
    489         itemsize = xp.finfo(self.dtype).bits // 8
    490 else:
--> 491     raise TypeError(
    492         "cannot compute the number of bytes (no array API nor nbytes / itemsize)"
    493     )
    495 return self.size * itemsize

TypeError: cannot compute the number of bytes (no array API nor nbytes / itemsize)

You can force loading the data and then printing works:

print(ds1["TMP"].values[0][0])
253.28014

print(ds1)
<xarray.Dataset> Size: 1MB
Dimensions:                   (y: 181, x: 360)
Coordinates:
    refDate                   datetime64[ns] 8B ...
    leadTime                  timedelta64[ns] 8B ...
    valueOfFirstFixedSurface  float64 8B ...
    latitude                  (y, x) float64 521kB ...
    longitude                 (y, x) float64 521kB ...
    validDate                 datetime64[ns] 8B ...
Dimensions without coordinates: y, x
Data variables:
    TMP                       (y, x) float32 261kB 253.3 253.3 ... 240.2 240.2
Attributes:
    engine:   grib2io

What did you expect to happen?

Want print(ds1) to print the summary of the dataset.

<xarray.Dataset> Size: 1MB
Dimensions:                   (y: 181, x: 360)
Coordinates:
    refDate                   datetime64[ns] 8B ...
    leadTime                  timedelta64[ns] 8B ...
    valueOfFirstFixedSurface  float64 8B ...
    latitude                  (y, x) float64 521kB ...
    longitude                 (y, x) float64 521kB ...
    validDate                 datetime64[ns] 8B ...
Dimensions without coordinates: y, x
Data variables:
    TMP                       (y, x) float32 261kB 253.3 253.3 ... 240.2 240.2
Attributes:
    engine:   grib2io

Minimal Complete Verifiable Example

# You have to download the GRIB2 file from 
"""
https://github.com/NOAA-MDL/grib2io/blob/master/tests/data/gfs_20221107/gfs.t00z.pgrb2.1p00.f012_subset
"""
import xarray as xr
filters = {
            "productDefinitionTemplateNumber": 0,
            "typeOfFirstFixedSurface": 1,
            "shortName": "TMP",
            }
ds1 = xr.open_dataset(
            "gfs_20221107/gfs.t00z.pgrb2.1p00.f012_subset",
            engine="grib2io",
            filters=filters,
        )
print(ds1)

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 5.15.0-112-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2 xarray: 2024.6.0 pandas: 2.2.1 numpy: 1.26.4 scipy: 1.12.0 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: None zarr: 2.17.1 cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: 2024.3.1 distributed: 2024.3.1 matplotlib: 3.8.4 cartopy: 0.22.0 seaborn: None numbagg: None fsspec: 2024.3.1 cupy: None pint: 0.23 sparse: None flox: None numpy_groupies: None setuptools: 69.2.0 pip: 24.0 conda: 24.3.0 pytest: 8.1.1 mypy: None IPython: 8.22.2 sphinx: 7.3.7
welcome[bot] commented 3 months ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 3 months ago

Thanks for the issue. We should definitely have a try/except for the bytes given that can fail...

keewis commented 3 months ago

as far as I can tell, the reason for this is that grib2io defines a OnDiskArray, where the dtype is a string. Which is unexpected, but since dtypes are opaque objects in the array API we might have to figure out how to deal with that at some point.

TimothyCera-NOAA commented 3 months ago

I tried to fix in grib2io replacing "float32" with np.float32 but didn't help, but what did work was enforcing np.dtype in xarray as shown in #9191

keewis commented 3 months ago

you'd have to replace it with the dtype instance, np.dtype("float32"). It looks like attribute descriptors like itemsize return self instead of the result of __get__ if called from the class object (np.float32) instead of the instance object.

TimothyCera-NOAA commented 3 months ago

This issue was supposed to be closed when I closed the #9191, but it wasn't. So closing...

As mentioned in the pull request, comments here and in the pull request were helpful to me tracking down how to fix in grib2io.