Requested time units written by to_netcdf() are only honoured if values are a dask array

LukeJones123 commented 4 months ago

What happened?

I want to write a dataset containing a time variable to netCDF. I specify the units I want in the variable encoding, but the precise string I specify is only written if the time values are a dask array. Otherwise, a "cleaned up" version is written in which the part of the string is removed.

What did you expect to happen?

The time units string written shouldn't depend on whether the time values are a dask array or not.

The "clean up" of the units string is done in _eagerly_encode_cf_datetime but is ignored (only) for dask arrays in _lazily_encode_cf_datetime. They should both behave the same.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
import dask.array as da
import subprocess

def write_file(bug_switch):

    # step is the dimension, and there is one time per step
    steps = [0]
    times = [np.datetime64('2004-07-01', 'ns') + s
             for s in steps]

    # WE ONLY GET THE CORRECT TIME UNITS IF THE TIMES ARE A DASK ARRAY
    if bug_switch == 'off':
        times = da.from_array(times)

    ds = xr.Dataset(
        data_vars={'valid_time': (['step'], times)},
        coords={'step': steps}
    )

    ds.valid_time.encoding = {
        'units': 'seconds since 1970-01-01T00:00:00',
        'dtype': np.dtype('float64')}

    path = f'bug_{bug_switch}.nc'
    ds.to_netcdf(path)

    # Show the result
    subprocess.run(['ncdump', '-h', path])

for bug_switch in ['on', 'off']:
    write_file(bug_switch)

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.
[X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

netcdf bug_on {
dimensions:
    step = 1 ;
variables:
    double valid_time(step) ;
        valid_time:_FillValue = NaN ;
        valid_time:units = "seconds since 1970-01-01" ;
        valid_time:calendar = "proleptic_gregorian" ;
    int64 step(step) ;
}
netcdf bug_off {
dimensions:
    step = 1 ;
variables:
    double valid_time(step) ;
        valid_time:_FillValue = NaN ;
        valid_time:units = "seconds since 1970-01-01T00:00:00" ;
        valid_time:calendar = "proleptic_gregorian" ;
    int64 step(step) ;
}

Anything else we need to know?

This is related to, but not the same as, #1449 and #8137. A solution to those issues would likely also be a solution to this one. I see no reason for the cleanup in _eagerly_encode_cf_datetime persisting through to the output file, so I think these strings should be written as they are.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.8 (main, Mar 6 2024, 11:51:07) [GCC 8.5.0 20210514 (Red Hat 8.5.0-15)] python-bits: 64 OS: Linux OS-release: 4.18.0-477.43.1.el8_8.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.3-development xarray: 2024.6.0 pandas: 2.2.2 numpy: 2.0.0 scipy: None netCDF4: 1.7.1.post1 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: None dask: 2024.7.0 distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2024.6.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.0 pip: 24.0 conda: None pytest: None mypy: None IPython: None sphinx: None

kmuehlbauer commented 2 months ago

@LukeJones123 Sorry for the delay here. It seems that both unit-strings are valid ISO 8601 datetime strings and refer to the same time. So technically this is no bug, but nevertheless we should enforce similar behaviour.

LukeJones123 commented 2 months ago

Thanks. It would be very useful for us because in our organisation we try to supply stable and consistent netCDF to external users (for whom stability and consistency are very important) and it currently means the time units string changes under certain circumstances.

pydata / xarray