pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Cannot store data after group_by #2847

Open volkerjaenisch opened 5 years ago

volkerjaenisch commented 5 years ago

Hi Xarray!

I really like your Library. But now I am stuck completely.

Code Sample, a copy-pastable example if possible

import numpy as np
import xarray as xr

data = [1,2,3,4,5,6,7,8,9,10]
bins = np.array(range(5)) * 2
xr_data = xr.Dataset({'data': data})
out = xr_data.groupby_bins('data', bins).mean()
out.to_netcdf('/tmp/test')

Problem description

Get Error : Traceback (most recent call last): File "/home/volker/workspace/pycharm-community-2018.1.2/helpers/pydev/pydevd.py", line 1664, in main() File "/home/volker/workspace/pycharm-community-2018.1.2/helpers/pydev/pydevd.py", line 1658, in main globals = debugger.run(setup['file'], None, None, is_module) File "/home/volker/workspace/pycharm-community-2018.1.2/helpers/pydev/pydevd.py", line 1068, in run pydev_imports.execfile(file, globals, locals) # execute the script File "/home/volker/workspace/pycharm-community-2018.1.2/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/volker/workspace/eprofile_wind/eprofile/src/eprofile/sandbox/test_xarray.py", line 12, in out.to_netcdf('/tmp/test') File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/core/dataset.py", line 1232, in to_netcdf compute=compute) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/backends/api.py", line 747, in to_netcdf unlimited_dims=unlimited_dims) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/backends/api.py", line 790, in dump_to_store unlimited_dims=unlimited_dims) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/backends/common.py", line 261, in store variables, attributes = self.encode(variables, attributes) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/backends/common.py", line 347, in encode variables, attributes = cf_encoder(variables, attributes) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/conventions.py", line 605, in cf_encoder for k, v in iteritems(variables)) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/conventions.py", line 605, in for k, v in iteritems(variables)) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/conventions.py", line 241, in encode_cf_variable var = ensure_dtype_not_object(var, name=name) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/conventions.py", line 201, in ensure_dtype_not_object data = _copy_with_dtype(data, dtype=_infer_dtype(data, name)) File "/home/volker/workspace/eprofile_wind-CRxNsezQ/lib/python3.5/site-packages/xarray/conventions.py", line 139, in _infer_dtype .format(name)) ValueError: unable to infer dtype on variable 'data_bins'; xarray cannot serialize arbitrary Python objects

Expected Output

The Dataset should be written to file in netcdf

Output of xr.show_versions()

>>> xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.5.3 (default, Sep 27 2018, 17:25:39) [GCC 6.3.0 20170516] python-bits: 64 OS: Linux OS-release: 4.9.0-8-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: de_DE.UTF-8 libhdf5: 1.10.2 libnetcdf: 4.4.1.1 xarray: 0.11.3 pandas: 0.24.1 numpy: 1.16.1 scipy: None netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.2.1 PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None cyordereddict: None
dcherian commented 5 years ago

Try data = np.array(...)

spencerkclark commented 5 years ago

Thanks for the issue. I think the main problem is that we currently do not have a way of saving an IntervalIndex, which groupby_bins produces, to a netCDF file:

In [7]: out.indexes['data_bins']
Out[7]:
IntervalIndex([(0, 2], (2, 4], (4, 6], (6, 8]],
              closed='right',
              dtype='interval[int64]')

One way to work around this in the meantime is to redefine the bins coordinate before saving things to a file. See @jhamman's answer to a related StackOverflow question for an example.

volkerjaenisch commented 5 years ago

Thank you @spencerkclark for the fast response. I did exactly as you advices and it works fine. A hint in the documentation may hinder others to fall into this pit. Also it would be nice to have intervals serialized into netCDF since they are quite common structures.

Cheers, Volker

fmaussion commented 5 years ago

A hint in the documentation may hinder others to fall into this pit.

Agreed. Would you like to submit a pull-request?

Also it would be nice to have intervals serialized into netCDF since they are quite common structures.

There are ways to deal with intervals in the CF conventions, but what we really need is a way for xarray to truly understand intervals, which is a much bigger endeavor.

rabernat commented 5 years ago

In the longer term, we could move towards supporting IntervalIndex as an in-memory representation of CF-conventions' cell bounds concept. In addition to many other benefits, this would allow us to encode and decode such indices from netCDF files.

xref #1475

rabernat commented 5 years ago

Also xref #2844