pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

xr.DataSet.from_dataframe / xr.DataArray.from_series does not preserve DateTimeIndex with timezone #3291

Open fjanoos opened 5 years ago

fjanoos commented 5 years ago

Problem Description

When using DataSet.from_dataframe (DataArray.from_series) to convert a pandas dataframe with DateTimeIndex having a timezone - xarray convert the datetime into a nanosecond index - rather than keeping it as a datetime-index type.

MCVE Code Sample

print( df.index ) 
DatetimeIndex(['2000-01-03 16:00:00-05:00', '2000-01-03 16:00:00-05:00',
               '2000-01-03 16:00:00-05:00', '2000-01-03 16:00:00-05:00',
               ...
               '2019-08-20 16:00:00-05:00', '2019-08-20 16:00:00-05:00'],
              dtype='datetime64[ns, EST]', name='time', length=12713014, freq=None)
ds = xr.DataSet.from_dataframe( df.head( 1000 )  ) 
print( ds['time'] )
<xarray.DataArray 'time' (time: 7)>
array([946933200000000000, 947019600000000000, 947106000000000000,
       947192400000000000, 947278800000000000, 947538000000000000,
       947624400000000000, ...], dtype=object)
Coordinates:
  * time     (time) object 946933200000000000 ... 947624400000000000

Expected Output

After removing the tz localization from the DateTimeIndex of the dataframe , the conversion to a DataSet preserves the time-index (without converting it to nanoseconds)

df.index = df.index.tz_convert('UTC').tz_localize(None)
ds = xr.DataSet.from_dataframe( df.head(1000) ) 
print( ds['time] )
<xarray.DataArray 'time' (time: 7)>
array(['2000-01-03T21:00:00.000000000', '2000-01-04T21:00:00.000000000',
       '2000-01-05T21:00:00.000000000', '2000-01-06T21:00:00.000000000',
       '2000-01-07T21:00:00.000000000', '2000-01-10T21:00:00.000000000',
       '2000-01-11T21:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-03T21:00:00 ... 2000-01-11T21:00:00

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.9.0-9-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: None xarray: 0.12.3+81.g41fecd86 pandas: 0.24.2 numpy: 1.16.2 scipy: 1.2.1 netCDF4: None pydap: None h5netcdf: None h5py: 2.9.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 dask: 1.1.4 distributed: 1.26.0 matplotlib: 3.0.3 cartopy: None seaborn: 0.9.0 numbagg: None setuptools: 40.8.0 pip: 19.0.3 conda: 4.7.11 pytest: 4.3.1 IPython: 7.4.0 sphinx: 1.8.5
shoyer commented 5 years ago

You should be getting a warning about this if you use the latest version of pandas. In the future, this behavior will change to return an object dtype array full of pandas Datetime objects. Unfortunately NumPy doesn't have a built-in datetime with time-zone stype, so this is about the best we can do.

scottyhq commented 3 years ago

Just wanted to rekindle discussion here and ping @dcherian and @benbovy , the current workaround for pandas DatetimeIndex with timezone info (dtype='datetime64[ns, EST]') is to drop the timezone piece or use to_index() and operate in pandas, then reassign the time coordinate: See https://github.com/pydata/xarray/issues/1036 and https://github.com/pydata/xarray/issues/3163.

If I'm following https://github.com/pydata/xarray/blob/master/design_notes/flexible_indexes_notes.md this is another potential example of improved user-friendliness where we could have timezone-aware indexes and therefore call pandas methods like pandas.core.indexes.datetimes.DatetimeIndex.tz_convert() directly as a DataArray method?

This would definitely be great for remote sensing data that is usually stored with UTC timestamps, but often analysis requires converting to local time.

dcherian commented 3 years ago

I am confused on the following point after reading the indexing refactor design notes on removing IndexVariable.

If ds["time"] is a 1D indexed coordinate, is ds["time"].data ≡ ds.indexes["time"].data? If so, that would just be a pd.DatetimeIndex which is timezone-aware and then this problem is solved because we don't maintain a separate numpy array. Am I understanding this correctly?

shoyer commented 3 years ago

If ds["time"] is a 1D indexed coordinate, is ds["time"].data ≡ ds.indexes["time"].data? If so, that would just be a pd.DatetimeIndex which is timezone-aware and then this problem is solved because we don't maintain a separate numpy array. Am I understanding this correctly?

No, unfortunate it is not possible to use a pandas.Index directly inside Variable.data, because pandas.Index is not compatible with the NumPy array API -- in particular it is stuck with 1D data. Instead, we will need to wrap the array in some adapter class to make it compatible. Ideally this wrapper would be a fully N-dimensional wrapper for pandas.Series objects, but for a first pass it would probably be fine to raise an error if indexing would create a higher dimensional array.

The bigger issue is that elsewhere in Xarray probably needs updates to avoid assuming that all dtype objects are numpy.dtype instances.