pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.49k stars 1.04k forks source link

Default time encoding of nanoseconds is NOT good. #9154

Open ChrisBarker-NOAA opened 1 week ago

ChrisBarker-NOAA commented 1 week ago

What happened?

When you call to_netcdf() on a Dataset without an encoding for the time variable set, you get: "nanoseconds since ..."

This is not good, as nanoseconds aren't support by a lot of downstream software, e.g. cftime, etc. (I'm not sure if it's CF compliant).

I see that this was changed somewhere after #3942

It is an improvement over integer days, that's for sure, but not the best option.

I understand that the internal panda datetime64 type is ns -- and this preserves any precision that may be there, but:

"practicality beats purity"

And this is not practical. Frankly, I wonder why pandas chose nanoseconds as the default, but it seems like the common xarray (and netcdf) use cases, ns is completely unneeded precision.

I'd suggest integer milliseconds, or, frankly even seconds.

Alternatively, use, say, floating point hours, which would buy you a lot of precision for small values -- I can't imagine who is using nanoseconds and also cares about years of timespan.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

welcome[bot] commented 1 week ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

spencerkclark commented 1 week ago

@ChrisBarker-NOAA could you provide a minimal working example? While there are select circumstances where xarray will do this, I do not think this should be the case in general.

kmuehlbauer commented 1 week ago

@ChrisBarker-NOAA In general xarray tries to encode times to the lowest possible resolution.

So if the wanted resolution is say days and the data itself can't be represented like that, then the next fitting resolution is calculated (eg. hours).

This also works if no resolution is given. Xarray tries to calculate that resolution where all data can be represented in.

There might have been changes in these parts of the code or other changes in behaviour due to dependency updates. I second @spencerkclark here, a MCVE would help to debug.

spencerkclark commented 1 week ago

To add a little more context, @ChrisBarker-NOAA you are right to point out that my comment (https://github.com/pydata/xarray/issues/3942#issuecomment-966264200) ignores considerations such as downstream (non-xarray or even non-Python) use-cases and (possibly) CF compliance, which I think may have been additional (perhaps more fundamental) reasons that xarray did not support encoding units finer than seconds previously (https://github.com/pydata/xarray/issues/4183#issuecomment-651404825). If I were to write it again I would probably not have been as absolute.

As @kmuehlbauer notes, for NumPy arrays of times we still try to use the coarsest possible resolution that enables exact encoding with int64 values. To me this represents a reasonable compromise. For users whose dates do not need nanosecond precision, times continue to be encoded with coarser units, but for users whose dates do need it, times are encoded with finer units such that they can be exactly round-tripped to / from disk (https://github.com/pydata/xarray/issues/4045, https://github.com/pydata/xarray/pull/4684).

For chunked arrays—and maybe this is what you are running into—there was a recent change to use "nanoseconds since 1970-01-01" as the default units if no encoding is specified (https://github.com/pydata/xarray/pull/8575). The rationale there was that to infer the coarsest possible units that would enable an exact round trip would otherwise require a full compute (enabling lazy encoding was the primary premise of that PR). Even though this is an advanced use-case, I understand that this could be surprising and/or inconvenient depending on the application. Does the time data you are writing to netCDF happen to be chunked?

ChrisBarker-NOAA commented 1 week ago

@spencerkclark: yup, that's probably it.

We've been bitten by this a fair bit lately -- always(?) when loading a Dataset from some kind a remote dataset: OPeNDAP, Zarr (via kerchunked data on AMS), or MFDataset, etc.

Then the time encoding gets lost or a new time array is created for one reason or another, and we end up with the nanosecond-based-encoding.

So yes, chunked.

Another issue is that it's all too common for folks to use a float datatype in the time variable, which means you get essentially arbitrary precision. (would you even get nanoseconds? not sure, I haven't done the math.)

So what to do?

1) looking at the past issues I see, folks did want milliseconds, but has anyone asked for nanoseconds? (i know, if it works, folks could be using them without us knowing it). I do think that defaulting to seconds was probably too coarse. So we could simply change the default to milliseconds, and I think that would solve the problem at hand.

2) Why do we get datetime65[ns] be default when decoding a netcdf file? I'm guessing that's inherited from Pandas, and maybe isn't easy to change, but if it could be changed, we could use [us] or even [s] (or some kind of smart decoding) instead - and then we're good to go.

-CHB

ChrisBarker-NOAA commented 1 week ago

Hmm -- if this is right (from a random unrelated issue on some other project):

"Since 2.0.0 [pandas] actually supports timestamp precisions other than nanoseconds"

Looks like it:

https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution

So maybe there's the solution :-)

kmuehlbauer commented 1 week ago

@ChrisBarker-NOAA One reason to use datetime64[ns] internally is that we can keep the int64 representation in memory (and on disk) also in the presence of NaT. np.datetime64('nat') and the netcdf default fillvalue for int64 are the same -> -9223372036854775808 (see netcdf.h).

In case of encoded _FillValue/missing_value we can use int64 throughout instead of having to convert to float (to represent the nan values). The latter has lead to a bunch of issues in the past, where ns precision was lost. See #7817 and #7872.

Please also see https://github.com/pydata/xarray/issues/7493 where the new pandas non-nanosecond datetimes are discussed.

I did not look deeper into this new behaviour, but it should be possible to make xarray aware of non-nanosecond datetime resolution. I fear this will be a major undertaking, though.

ChrisBarker-NOAA commented 1 week ago

One reason to use datetime64[ns] internally is that we can keep the int64 representation in memory (and on disk) also in the presence of NaT. np.datetime64('nat') and the netcdf default fillvalue for int64 are the same -> -9223372036854775808

Which is a good reason to use datetime64 -- (and 64 bit integers in general) -- but numpy's datetime64 can choose what "unit" you want, so datetime64[ms] or datetime64[s] satisfies these same criteria.

The latter has lead to a bunch of issues in the past, where ns precision was lost. See https://github.com/pydata/xarray/issues/7817 and https://github.com/pydata/xarray/issues/7872.

Very good reasons not to covert to float -- so yes, but I'm not suggesting that.

Frankly, numpy's datetime64 was not all that thoroughly reviewed before it was implemented, and has been trouble ever since :-( -- but that's water under the bridge.

As for nanoseconds: I suspect that was chosen by pandas originally, because "why not" -- i.e. lots of extra bits to use.

But whether xarray continues to use ns internally is another question -- has ANYONE ever asked for ns?

But a breaking change is a breaking change, which is not good.

I did not look deeper into this new behaviour, but it should be possible to make xarray aware of non-nanosecond datetime resolution. I fear this will be a major undertaking, though.

An easier option would probably be to go to datetime64[ms] instead, though somewhere out there is someone actually using nanoseconds that might complain :-(

But it may not be THAT major an undertaking -- grepping the code, there are indeed a lot of references to datetime64[ns]. but when you remove the tests and docstrings -- not all that many.

so. maybe a lot to change, but mostly boilerplate :-(

The first question is -- would this be a welcome change? If so, maybe I can take a look ...

Option 1) Change [ns] to [ms] as the static xarray datetime.

So:

Option(2) is probably the way to go -- let it be settable -- and then we can decide where to change the default -- e.g. I would suggest using [s], or maybe [ms] for default netcdf decoding, for instance.

How good do we think the test coverage is ?

Finally: option (3) -- just change the default behavior for encoding netcdf (or other similar formats (zarr) -- essentially any time you are doing a "units since datetime" encoding, use milliseconds.

That would be relatively easy, and I suspect break very little -- and anyone that does need nanoseconds could specify that.

spencerkclark commented 6 days ago

@spencerkclark: yup, that's probably it.

Thanks @ChrisBarker-NOAA, that's good to know. For this issue, my first thought was something along the lines of your option (3) for chunked arrays. The idea would be to switch to using default units of "seconds since 1970-01-01" or "milliseconds since 1970-01-01" in that context and continue using a default dtype of int64. If finer precision were required to encode the times for a faithful round trip, we would raise an error, recommending explicitly specifying the units encoding. While this would lead to a slightly worse user experience for those who might want / need nanosecond encoding units in the chunked array context[^1][^2], it would be a better user experience for those who do not, at least in the context of downstream uses.

Indeed as @kmuehlbauer notes, we are aware of the new flexibility in pandas. I agree that it would be a nontrivial amount of work to enable that flexibility in xarray. There probably is a lot of boilerplate, but I think there are a few things we'll need to think carefully about too. It is likely best to separate that from this issue.

[^1]: We have had some users ask for nanosecond-units decoding before (#4183), and it is important for faithfully roundtripping some times (#4045 (comment)). It is admittedly rare, however. [^2]: pandas-dev/pandas#7307 (comment) provides some historical context for why nanosecond precision was initially chosen in pandas.