pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.54k stars 1.06k forks source link

Interoperability with Pandas 2.0 non-nanosecond datetime #7493

Open khider opened 1 year ago

khider commented 1 year ago

Is your feature request related to a problem?

As mentioned in this post on the Pangeo discourse, Pandas 2.0 will fully support non-nanosecond datetime as indices. The motivation for this work was the paleogeosciences; a community who needs to represent time in millions of years. One of the biggest motivator is also to facilitate paleodata - model comparison. Enter xarray!

Below is a snippet of code to create a Pandas Series with a non-nanosecond datetime and export to xarray (this works). However, most of the interesting functionalities of xarray don't seem to support this datetime out-of-box:

import pandas as pd
import xarray as xr

pds = pd.Series([10, 12, 11, 9], index=np.array(['-2000-01-01', '-2005-01-01', '-2008-01-01', '-2009-01-01']).astype('M8[s]'))
xra = pds.to_xarray()
xra.plot() #matplotlib error
xra.sel(index='-2009-01-01', method='nearest') 

To test, you will need the Pandas nightly built:

pip uninstall pandas -y
pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas>1.9

Describe the solution you'd like

Work towards an integration of the new datetimes with xarray, which will support users beyond the paleoclimate community

Describe alternatives you've considered

No response

Additional context

No response

TomNicholas commented 1 year ago

Hi @khider , thanks for raising this.

For those of us who haven't tried to use non-nanosecond datetimes before (e.g. me), could you possibly expand a bit more on

However, most of the interesting functionalities of xarray don't seem to support this datetime out-of-box:

specifically, where are errors being thrown from within xarray? And what functions are you referring to as examples?

keewis commented 1 year ago

we are casting everything back to datetime64[ns] when creating xarray objects, for example, so the only way to even get a non-nanosecond datetime variable is (or was, we might have fixed that?) through the zarr backend (though that would / might fail elsewhere).

@spencerkclark knows much more about this, but in any case we're aware of the change and are working it (see e.g. #7441). (To be fair, though, at the moment it is mostly Spencer who's working on it, and he seems to be pretty preoccupied.)

spencerkclark commented 1 year ago

Thanks for posting this general issue @khider. This is something that has been on my radar for several months and I'm on board with it being great to support (eventually this will likely help cftime support as well).

I might hesitate to say that I'm actively working on it yet 😬. Right now, in the time I have available, I'm mostly trying to make sure that xarray's existing functionality does not break under pandas 2.0. Once things are a little more stable in pandas with regard to this new feature my plan is to take a deeper dive into what it will take to adopt in xarray (some aspects might need to be handled delicately). We can plan on using this issue for more discussion.

As @keewis notes, xarray currently will cast any non-nanosecond precision datetime64 or timedelta64 values that are introduced to nanosecond-precision versions. This casting machinery goes through pandas, however, and I haven't looked carefully into how this is behaving/is expected to behave under pandas 2.0. @khider based on your nice example it seems that it is possible for non-nanosecond-precision values to slip through, which is something we may need to think about addressing for the time being.

khider commented 1 year ago

Hi all,

Thank you for looking into this. I was very excited when the array was created from my non-nanosecond datetime index but I couldn't do much manipulations beyond creation.

spencerkclark commented 1 year ago

Indeed it would be nice if this "just worked" but it may take some time to sort out (sorry that this example initially got your hopes up!). Here what I mean by "address" is continuing to prevent non-nanosecond-precision datetime values from entering xarray through casting to nanosecond precision and raising an informative error if that is not possible. This of course would be temporary until we work through the kinks of enabling such support. In the big picture it is exciting that pandas is doing this in part due to your grant.

dcherian commented 1 year ago

@khider It would be helpful if either you or someone on your team tried to make it work and opened a PR. That would give us a sense of what's needed and might speed it along. It would be an advanced change, but we'd be happy to provide feedback.

Adding expected-fail tests would be particularly helpful!

spencerkclark commented 1 year ago

@dcherian +1. I'm happy to engage with others if they are motivated to start on this earlier.

khider commented 1 year ago

I might need some help with the xarray codebase. I use it quite often but never had to dig into its guts.

TomNicholas commented 1 year ago

@khider we are more than happy to help with digging into the codebase! A reasonable place to start would be just trying the operation you want to perform, and looking through the code for the functions any errors get thrown from.

You are also welcome to join our bi-weekly community meetings (there is one tomorrow morning!) or the office hours we run.

spencerkclark commented 1 year ago

I can block out time to join today's meeting or an upcoming one if it would be helpful.

khider commented 1 year ago

I can attend it too. 8:30am PST, correct?

spencerkclark commented 1 year ago

Great -- I'll plan on joining. That's correct. It is at 8:30 AM PT (https://github.com/pydata/xarray/issues/4001).

spencerkclark commented 1 year ago

Thanks for joining the meeting today @khider. Some potentially relevant places in the code that come to my mind are:

Though as @shoyer says, searching for datetime64[ns] or timedelta64[ns] will probably go a long way toward finding most of these issues.

Some design questions that come to my mind are (but you don't need an answer to these immediately to start working):

khider commented 1 year ago

Thank you!

The second point that you raise is what we are concerned about right now as well. So maybe it would be good to try to resolve it. How do you deal with PMIP simulations in terms of calendar?

spencerkclark commented 1 year ago

Currently in xarray we make the choice based on the calendar attribute associated with the data on disk (following the CF conventions). If the data has a non-standard calendar (or cannot be represented with nanosecond-precision datetime values) then we use cftime; otherwise we use NumPy. Which kind of calendar do PMIP simulations typically use?

For some background -- my initial need in this realm came mainly from idealized climate model simulations (e.g. configured to start on 0001-01-01 with a no-leap calendar), so I do not have a ton of experience with paleoclimate research. I would be happy to learn more about your application, however!

mjwillson commented 6 months ago

Hi all, I just ran into a really nasty-to-track-down bug in xarray (version 2023.08.0, apologies if this is fixed since) where non-nanosecond datetimes are creeping in via expand_dims. Look at the difference between expand_dims and assign_coords:

In [33]: xarray.Dataset().expand_dims({'foo': [np.datetime64('2018-01-01')]})
Out[33]: 
<xarray.Dataset>
Dimensions:  (foo: 1)
Coordinates:
  * foo      (foo) datetime64[s] 2018-01-01
Data variables:
    *empty*

In [34]: xarray.Dataset().assign_coords({'foo': [np.datetime64('2018-01-01')]})
third_party/py/xarray/core/utils.py:1211: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
third_party/py/xarray/core/utils.py:1211: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
Out[34]: 
<xarray.Dataset>
Dimensions:  (foo: 1)
Coordinates:
  * foo      (foo) datetime64[ns] 2018-01-01
Data variables:
    *empty*

It seems for the time being xarray depends on datetime64[ns] being used everywhere for correct behaviour -- I've seen some very weird data corruption silently happen with datetimes when the wrong datetime64 types are used accidentally due to this bug. So good to be consistent about always enforcing datetime64[ns], for as long as this is the case.

spencerkclark commented 6 months ago

Agreed, many thanks for the report @mjwillson—we'll have to track down why this slips through in the case of expand_dims.

spencerkclark commented 6 months ago

@mjwillson I think I tracked down the cause of the expand_dims issue—see #8782 for a fix.