pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.59k stars 1.08k forks source link

Requested time units written by to_netcdf() are only honoured if values are a dask array #9268

Open LukeJones123 opened 2 months ago

LukeJones123 commented 2 months ago

What happened?

I want to write a dataset containing a time variable to netCDF. I specify the units I want in the variable encoding, but the precise string I specify is only written if the time values are a dask array. Otherwise, a "cleaned up" version is written in which the part of the string is removed.

What did you expect to happen?

The time units string written shouldn't depend on whether the time values are a dask array or not.

The "clean up" of the units string is done in _eagerly_encode_cf_datetime but is ignored (only) for dask arrays in _lazily_encode_cf_datetime. They should both behave the same.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
import dask.array as da
import subprocess

def write_file(bug_switch):

    # step is the dimension, and there is one time per step
    steps = [0]
    times = [np.datetime64('2004-07-01', 'ns') + s
             for s in steps]

    # WE ONLY GET THE CORRECT TIME UNITS IF THE TIMES ARE A DASK ARRAY
    if bug_switch == 'off':
        times = da.from_array(times)

    ds = xr.Dataset(
        data_vars={'valid_time': (['step'], times)},
        coords={'step': steps}
    )

    ds.valid_time.encoding = {
        'units': 'seconds since 1970-01-01T00:00:00',
        'dtype': np.dtype('float64')}

    path = f'bug_{bug_switch}.nc'
    ds.to_netcdf(path)

    # Show the result
    subprocess.run(['ncdump', '-h', path])

for bug_switch in ['on', 'off']:
    write_file(bug_switch)

MVCE confirmation

Relevant log output

netcdf bug_on {
dimensions:
    step = 1 ;
variables:
    double valid_time(step) ;
        valid_time:_FillValue = NaN ;
        valid_time:units = "seconds since 1970-01-01" ;
        valid_time:calendar = "proleptic_gregorian" ;
    int64 step(step) ;
}
netcdf bug_off {
dimensions:
    step = 1 ;
variables:
    double valid_time(step) ;
        valid_time:_FillValue = NaN ;
        valid_time:units = "seconds since 1970-01-01T00:00:00" ;
        valid_time:calendar = "proleptic_gregorian" ;
    int64 step(step) ;
}

Anything else we need to know?

This is related to, but not the same as, #1449 and #8137. A solution to those issues would likely also be a solution to this one. I see no reason for the cleanup in _eagerly_encode_cf_datetime persisting through to the output file, so I think these strings should be written as they are.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.8 (main, Mar 6 2024, 11:51:07) [GCC 8.5.0 20210514 (Red Hat 8.5.0-15)] python-bits: 64 OS: Linux OS-release: 4.18.0-477.43.1.el8_8.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.3-development xarray: 2024.6.0 pandas: 2.2.2 numpy: 2.0.0 scipy: None netCDF4: 1.7.1.post1 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: None dask: 2024.7.0 distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2024.6.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.0 pip: 24.0 conda: None pytest: None mypy: None IPython: None sphinx: None
kmuehlbauer commented 3 weeks ago

@LukeJones123 Sorry for the delay here. It seems that both unit-strings are valid ISO 8601 datetime strings and refer to the same time. So technically this is no bug, but nevertheless we should enforce similar behaviour.

LukeJones123 commented 3 weeks ago

Thanks. It would be very useful for us because in our organisation we try to supply stable and consistent netCDF to external users (for whom stability and consistency are very important) and it currently means the time units string changes under certain circumstances.