pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

Datetimelike Array Refactor #23185

Closed TomAugspurger closed 5 years ago

TomAugspurger commented 6 years ago

A master issue, to help keep track of things.

High-level, I think we have two things to sort out

  1. Design our datetimelike Arrays (DatetimeArray, TimedeltaArray, PeriodArray)
  2. A plan for getting from master to an implementation of that design.

1. Design

We have a few things to sort out

a. Composition vs. inheritance of Index / Series and array classes b. ...

2. Implementation Plan

A few options

a. Big PR implementing one or more arrays, followed by smaller cleanup PRs b. Incremental PRs (implementing most of the interface on the *ArrayMixin classes), followed by a final PR making the switch c. ...

Project board with the relevant discussions / PRs: https://github.com/pandas-dev/pandas/projects/4

jbrockmendel commented 5 years ago

thoughts on how where we should proceed for the next few steps?

Am I right in thinking that after #23643 the remaining pieces of the EA interface are just _from_sequence and optionally _values_for_argsort, argsort, and _reduce? _from_sequence I'm working on (#23675, #23702), and those other methods are orthogonal so could be worked on in parallel.

For the transition to composition, if someone wants to work on it before the EA interface is complete, the approach you're taking in the disown branch looks reasonable. I'm also seeing a lot of things in that branch that could be implemented immediately before the Arrays are disowned.

If somehow we find ourselves with excess labor to throw at related tasks, some orthogonal-ish topics:

@TomAugspurger is this helpful or too much of a grab-bag?

TomAugspurger commented 5 years ago

the remaining pieces of the EA interface are just _from_sequence and optionally _values_for_argsort, argsort, and _reduce? _from_sequence I'm working on (#23675, #23702), and those other methods are orthogonal so could be worked on in parallel.

My main concern there is that until we inherit from ExtensionArray, those will be untested (or we'll have to duplicate tests). Unless... I suppose we could start inheriting the base tests, without actually inheriting from them yet? That may be worth exploring. Some things like pd.isna will fail hard until we inherit, but others should work. We could also put temporary checks in things like is_extension_array_dtype to also recognize DatetimeArray and TimedeltaArray.

I suspect the transpose bug will be fixed by inheriting from ExtensionArray.

jorisvandenbossche commented 5 years ago

@TomAugspurger If you have currently the time to further work on that branch to try to switch to composition, I would say: let's push for it now and first (and thus temporarily hold on for other changes). I should have time on Friday to thoroughly review it (and also a bit in the weekend), today/tomorrow I am at PyParis.

I already said that before, but the sooner we can actually switch to composition, the clearer the follow-up PRs will be (eg now the _concat_same_type was already wrong in the PR, possibly because it was simply not exercised in all our concat related tests, because as long it is not an actual extension array, it is not used.

Also all the datetime-arithmetic related issues, they can in principle be done after the split. As long as artithmetic works for Index/Series (for which it is already tested), it's fine (of course, before releasing we should also fix + test all arithmetic on the arrays as well, but just to say it is not necessary to do that first)

jbrockmendel commented 5 years ago

let's push for it now and first (and thus temporarily hold on for other changes).

@jorisvandenbossche I would really appreciate it if you didn't advocate shutting down all progress on things that I'm putting a lot of time and effort into.

TomAugspurger commented 5 years ago

That's not how I read Joris' comment. I read "temporary hold" as... just that. A pause, not a shutting down or throwing away. I think all the effort in open PRs (and possibly some unpushed work you've done) is still vital.

There are many paths from master (plus the open PRs) to DatetimeArray. To me, a path that frontloads the switch to composition makes sense, but it's hard to say ahead of time. I've had trouble thinking through all the ramifications of a diff, partly because the current class hierarchy "feels weird" to me, and partly because I'm not familiar with this section of the code base.


Anyway, I think that https://github.com/pandas-dev/pandas/pull/23675 and https://github.com/pandas-dev/pandas/pull/23642/ are the next to go in. @jbrockmendel do you have other WIP branches changing the indexes or arrays datetimelike files that you'd like to push?

On my own availability: I'm going to be ramping up on a largish project for dask in the few days / weeks. I'll still have time for pandas, but not as much over the next month or so. So I have a slight window to dump all my time into pandas, that I'd like to take advantage of if possible.

jorisvandenbossche commented 5 years ago

@jbrockmendel yes, sorry if that is the way it came over. I certainly think we can do parallel work, and not all the items you listed in https://github.com/pandas-dev/pandas/issues/23185#issuecomment-435592816 would go in such a split PR, it was just about what to merge first as it will probably be easier to rebase the smaller PRs than the other way around. And also, if we do that push, there will also be a lot of reviewing work :)

jorisvandenbossche commented 5 years ago

And also, it will depend on the PR of course. If it is something that doesn't overlap a lot (like certain test changes), for sure it doesn't need to wait with being merged.

TomAugspurger commented 5 years ago

Have we had a design discussion on DatetimeDtype? A short proposal:

Rename to DatetimeDtype. We can keep the both the unit and tz arguments (in case we ever support resolutions other than ns in the future), but we continue to raise on non-ns unit.

class DatetimeDtype(ExtensionDtype):
    def __new__(self, unit='ns', tz: Optional[Union[str, tzinfo]]=None):
        ...

We remove the "magic" creation from string.

In [3]: pd.core.dtypes.dtypes.DatetimeTZDtype('datetime64[ns, utc]')
Out[3]: datetime64[ns, utc]

that would throw an error, since 'datetime64[ns, utc]' isn't a valid unit.


One question though... I'm worried about changing the .dtype exposed to users via Series.dtype and DatetimeIndex.dtype

In [3]: pd.Series(pd.date_range('2000', periods=4)).dtype
Out[3]: dtype('<M8[ns]')

I'm not sure what the ramifications of changing that to always be DatetimeDtype would be. Of course, when a timezone is involved we'll need to the dtype to be a DatetimeDtype, but should we use the numpy type when it suffices?

jorisvandenbossche commented 5 years ago

One question though... I'm worried about changing the .dtype exposed to users via Series.dtype and DatetimeIndex.dtype

Yes. What I thought before about this is that we would need to have an if/else logic there, to still return the numpy dtype if there is no tz. I would put this logic on the Series/Index, and have the array always return the extension dtype.

However, there might be places where we check the dtype of values which can be coming from Series/Index to be an ExtensionDtype? (eg to take another code path for extension arrays) So the above might be problematic for such code paths?

TomAugspurger commented 5 years ago

I would put this logic on the Series/Index, and have the array always return the extension dtype.

Agreed that's the right place to do it. I'll see if any of your concerns come up (I suspect something will).

jorisvandenbossche commented 5 years ago

I'll see if any of your concerns come up (I suspect something will).

I would expect many of the places where we use is_extension_array_dtype and pass it the actual dtype and not the array or container ..

TomAugspurger commented 5 years ago

Yeah. We could change those to is_extension_array_dtype(self._values.dtype) where necessary.

On Thu, Nov 15, 2018 at 9:23 AM Joris Van den Bossche < notifications@github.com> wrote:

I'll see if any of your concerns come up (I suspect something will).

I would expect many of the places where we use is_extension_array_dtype and pass it the actual dtype and not the array or container ..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439077941, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIo6kZlwqutnhAIBlFItwRfHAg8bsks5uvYb9gaJpZM4Xen4p .

TomAugspurger commented 5 years ago

Small status update here: I played with moving to composition a bit last week. The basic idea was

The biggest challenge was our is_datetime* functions. They were breaking in a lot of places and in strange ways when passed a DatetimeDtype rather than an np.dtype.

Today, I've experimented with a new branch that changes the data model of DatetimeArray slightly. DatetimeArray.dtype is now a Union of np.dtype or DatetimeDtype. We'll use DatetimeDtype (or keep the name as DatetimeTZDtype) when there's a timezone, and we'll use np.dtype('M8[ns]') otherwise. This should result in a much smaller diff. I suspect that we can later clean up the dtypes so that DatetimeArray.dtype is always a DatetimeDtype, but I think that need not block the release.

I'll push something up by the end of the day.

On Thu, Nov 15, 2018 at 9:25 AM Tom Augspurger tom.augspurger88@gmail.com wrote:

Yeah. We could change those to is_extension_array_dtype(self._values.dtype) where necessary.

On Thu, Nov 15, 2018 at 9:23 AM Joris Van den Bossche < notifications@github.com> wrote:

I'll see if any of your concerns come up (I suspect something will).

I would expect many of the places where we use is_extension_array_dtype and pass it the actual dtype and not the array or container ..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439077941, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIo6kZlwqutnhAIBlFItwRfHAg8bsks5uvYb9gaJpZM4Xen4p .

jbrockmendel commented 5 years ago

@TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?

On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.

Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.

TomAugspurger commented 5 years ago

Still just grinding away at the inheritance -> composition move. Mostly just moving around methods / adding wrappers in small places.

I haven't really touched internals yet. I'm not sure when the best time to do that would be. For DatetimeArrray, we can actually push that discussion off till after we switch things, since we already have two blocks. I'll post again when I have a better-formed opinion here.

On Mon, Nov 19, 2018 at 9:26 AM jbrockmendel notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?

On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 https://github.com/pandas-dev/pandas/pull/23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.

Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439931236, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInHRH7TnhC4sXtDzUFWQggFmpjM6ks5uws26gaJpZM4Xen4p .

TomAugspurger commented 5 years ago

I've isolated (one of?) the segfaults to DatetimeArray.new calling conversion.ensure_datetime64ns.

This segfaults on my branch

diff --git a/pandas/tests/groupby/test_apply.py
b/pandas/tests/groupby/test_apply.py
index 3bc5e51ca..e64bdc9ea 100644
--- a/pandas/tests/groupby/test_apply.py
+++ b/pandas/tests/groupby/test_apply.py
@@ -6,6 +6,13 @@ from pandas.util import testing as tm
 from pandas import DataFrame, MultiIndex, compat, Series, bdate_range,
Index

+def test_apply_tz():
+    df = pd.DataFrame({'a': [1, 3, 3, 4]},
+                      index=pd.DatetimeIndex(['2000', '2000', '2001',
'2001']))
+    gr = df.groupby(df.index.date)
+    gr.apply(lambda x: x.idxmax())
+
+

But passes when we don't call ensure_datetime64ns

diff --git a/pandas/core/arrays/datetimes.py
b/pandas/core/arrays/datetimes.py
index 65f6d6859..612e48792 100644
--- a/pandas/core/arrays/datetimes.py
+++ b/pandas/core/arrays/datetimes.py
@@ -258,7 +258,7 @@ class DatetimeArrayMixin(dtl.DatetimeLikeArrayMixin):

         assert isinstance(values, np.ndarray), type(values)
         assert is_datetime64_dtype(values)  # not yet assured nanosecond
-        values = conversion.ensure_datetime64ns(values, copy=False)
+        # values = conversion.ensure_datetime64ns(values, copy=False)

         result = cls._simple_new(values, freq=freq, tz=tz)
         if freq_infer:

I haven't figured out the actual cause yet.

On Wed, Nov 21, 2018 at 7:17 AM Tom Augspurger tom.augspurger88@gmail.com wrote:

Still just grinding away at the inheritance -> composition move. Mostly just moving around methods / adding wrappers in small places.

I haven't really touched internals yet. I'm not sure when the best time to do that would be. For DatetimeArrray, we can actually push that discussion off till after we switch things, since we already have two blocks. I'll post again when I have a better-formed opinion here.

On Mon, Nov 19, 2018 at 9:26 AM jbrockmendel notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?

On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 https://github.com/pandas-dev/pandas/pull/23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.

Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439931236, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInHRH7TnhC4sXtDzUFWQggFmpjM6ks5uws26gaJpZM4Xen4p .

TomAugspurger commented 5 years ago

Alrighty, we're close now https://github.com/TomAugspurger/pandas/tree/disown-tz-only

Right now this diff is at

89 files changed, 1859 insertions(+), 906 deletions(-)

and I have 100 xfails / skips. I'm going to spend the rest of today splitting of independent pieces, cleaning things up, and organizing the history a bit, before making a PR tonight or tomorrow.

On Tue, Nov 27, 2018 at 9:46 AM Tom Augspurger tom.augspurger88@gmail.com wrote:

I've isolated (one of?) the segfaults to DatetimeArray.new calling conversion.ensure_datetime64ns.

This segfaults on my branch

diff --git a/pandas/tests/groupby/test_apply.py
b/pandas/tests/groupby/test_apply.py
index 3bc5e51ca..e64bdc9ea 100644
--- a/pandas/tests/groupby/test_apply.py
+++ b/pandas/tests/groupby/test_apply.py
@@ -6,6 +6,13 @@ from pandas.util import testing as tm
 from pandas import DataFrame, MultiIndex, compat, Series, bdate_range,
Index

+def test_apply_tz():
+    df = pd.DataFrame({'a': [1, 3, 3, 4]},
+                      index=pd.DatetimeIndex(['2000', '2000', '2001',
'2001']))
+    gr = df.groupby(df.index.date)
+    gr.apply(lambda x: x.idxmax())
+
+

But passes when we don't call ensure_datetime64ns

diff --git a/pandas/core/arrays/datetimes.py
b/pandas/core/arrays/datetimes.py
index 65f6d6859..612e48792 100644
--- a/pandas/core/arrays/datetimes.py
+++ b/pandas/core/arrays/datetimes.py
@@ -258,7 +258,7 @@ class DatetimeArrayMixin(dtl.DatetimeLikeArrayMixin):

         assert isinstance(values, np.ndarray), type(values)
         assert is_datetime64_dtype(values)  # not yet assured nanosecond
-        values = conversion.ensure_datetime64ns(values, copy=False)
+        # values = conversion.ensure_datetime64ns(values, copy=False)

         result = cls._simple_new(values, freq=freq, tz=tz)
         if freq_infer:

I haven't figured out the actual cause yet.

On Wed, Nov 21, 2018 at 7:17 AM Tom Augspurger tom.augspurger88@gmail.com wrote:

Still just grinding away at the inheritance -> composition move. Mostly just moving around methods / adding wrappers in small places.

I haven't really touched internals yet. I'm not sure when the best time to do that would be. For DatetimeArrray, we can actually push that discussion off till after we switch things, since we already have two blocks. I'll post again when I have a better-formed opinion here.

On Mon, Nov 19, 2018 at 9:26 AM jbrockmendel notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?

On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 https://github.com/pandas-dev/pandas/pull/23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.

Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439931236, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInHRH7TnhC4sXtDzUFWQggFmpjM6ks5uws26gaJpZM4Xen4p .

jbrockmendel commented 5 years ago

Excellent, looking forward to taking a look. Are the skips segfault-free?

With a little luck many of the xfails will be fixed by implementing the remaining methods on DTA/TDA, most of which are (hopefully) near merging.

TomAugspurger commented 5 years ago

They are indeed segfault free. There's still a subtle failure involving a groupby resample coming from us doing bad stuff in Cython. We somehow manage to create a DatetimeIndex where DatetimeIndex._values is an ndarray, rather than a DatetimeArray. This causes an exception, but not a segfault.

On Wed, Nov 28, 2018 at 1:59 PM jbrockmendel notifications@github.com wrote:

Excellent, looking forward to taking a look. Are the skips segfault-free?

With a little luck many of the xfails will be fixed by implementing the remaining methods on DTA/TDA, most of which are (hopefully) near merging.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-442584247, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIknZGdH6zhZhTV2y0Dz4wsOwxKQbks5uzusdgaJpZM4Xen4p .

jbrockmendel commented 5 years ago

A thought on a way forward, seeing as how Tom has earned some down-time.

With _index_data implemented, I'm finding the approach used in the previously-segfaulting branch (tentatively) working. Define on DTI/TDI/PI respectively:

@property
def _eadata(self):
    return DatetimeArray._simple_new(self._data, freq=self.freq, tz=self.tz)

@property
def _eadata(self):
    return TimedeltaArray._simple_new(self._data, freq=self.freq)

@property
def _eadata(self):
    return self._data

Then do the entire inheritance/composition switchover, but dispatching to self._eadata instead of self._data. (Several steps later we'll remove _eadata and dispatch to _data).

This limits the diff to the index classes without changing their outward-facing behavior, making for a much more manageable scope.

Thoughts?

TomAugspurger commented 5 years ago

My vote is for getting #24024 in sooner rather than later, but I'm the most familiar with the diff so it's easier for me to go through the entire thing at once. It's blocking several changes I'd like to wrap up, and my time for pandas is limited.

jbrockmendel commented 5 years ago

My vote is for getting #24024 in sooner rather than later

AFAICT the sticking points are:

I have no strong opinion on which approach to take for the astype question. For the rest, I think the best route to "sooner rather than later" is to merge #24394