pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

xarray v 2023.9.0: ```ValueError: unable to infer dtype on variable 'time'; xarray cannot serialize arbitrary Python objects``` #8653

Open jerabaul29 opened 9 months ago

jerabaul29 commented 9 months ago

What happened?

I tried to save an xarray dataset with datetimes as data for its time dimension to a nc file with to_netcdf and got the error ValueError: unable to infer dtype on variable 'time'; xarray cannot serialize arbitrary Python objects.

What did you expect to happen?

I expected xarray to automatically detect these were datetimes, and convert them to whatever format xarray likes to work with internally to dump it into a CF compatible file, following what is described at https://github.com/pydata/xarray/issues/2512 .

Minimal Complete Verifiable Example

import xarray as xr
import datetime

times = [datetime.datetime(2024, 1, 1, 1, 1, 1, tzinfo=datetime.timezone.utc), datetime.datetime(2024, 1, 1, 1, 1, 2, tzinfo=datetime.timezone.utc)]

data = [1, 2]

xr_result = xr.Dataset(
    {
        'time':
        xr.DataArray(dims=["time"],
                     data=times,
                     attrs={
                         "standard_name": "time",
                     }),
        #
        'data':
        xr.DataArray(dims=["time"],
                     data=data,
                     attrs={
                         "_FillValue": "NaN",
                         "standard_name": "some_data",
                     }),
    }
)

xr_result.to_netcdf("test.nc")

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

The example is available as a notebook viewable at:

https://github.com/jerabaul29/public_bug_reports/blob/main/xarray/2024_01_24/xarray_and_datetimes.ipynb

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 6.5.0-14-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2023.9.0 pandas: 2.0.3 numpy: 1.25.2 scipy: 1.11.3 netCDF4: 1.6.2 pydap: None h5netcdf: None h5py: 3.10.0 Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.5 dask: 2023.9.2 distributed: 2023.9.2 matplotlib: 3.7.2 cartopy: 0.21.1 seaborn: 0.13.0 numbagg: None fsspec: 2023.9.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.0.0 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.15.0 sphinx: None
kmuehlbauer commented 9 months ago

That seems to be some tricky issue with timezones.

The issue already manifests in the DataArray (Variable) creation. The given list of timezone aware datetimes is converted into pandas._libs.tslibs.timestamps.Timestamp (wrapped as numpy 'O'). This happens in https://github.com/pydata/xarray/blob/c9ba2be2690564594a89eb93fb5d5c4ae7a9253c/xarray/core/variable.py#L213 In the further course there is no way of conversion to some numpy datetime64[ns] or similar to correctly serialize.

Same happens if you wrap your data as numpy array. This only works when stripping the tzinfo from the array (either by not adding tzinfo in the first place or casting to a proper type):

times = np.array(times).astype("<M8[ns]")

I'm not versed in that special part of DataArray/Variable creation with timezone aware datetimes and how to properly solve that issue. Hoping that others have more insight here, @spencerkclark?