Closed TomAugspurger closed 5 years ago
thoughts on how where we should proceed for the next few steps?
Am I right in thinking that after #23643 the remaining pieces of the EA interface are just _from_sequence
and optionally _values_for_argsort
, argsort
, and _reduce
? _from_sequence
I'm working on (#23675, #23702), and those other methods are orthogonal so could be worked on in parallel.
For the transition to composition, if someone wants to work on it before the EA interface is complete, the approach you're taking in the disown branch looks reasonable. I'm also seeing a lot of things in that branch that could be implemented immediately before the Arrays are disowned.
If somehow we find ourselves with excess labor to throw at related tasks, some orthogonal-ish topics:
box
fixture with box_with_period
, box_with_datetime
, and box_with_timedelta
pd.array
(or a temporary kludge in e.g pd.util.testing
) is in place those three separate fixtures can be boiled down to just box_with_array
min
/max
/argmin
/argmax
from DatetimeIndexOpsMixin
to DatetimeLikeArrayMixin
(and wrap where appropriate). mean, std, and presumably others can be implemented in terms of the i8 values.DataFrame.transpose
:
>>> dti = pd.date_range('2016-01-01', periods=3, tz='US/Pacific')
>>> pd.DataFrame(dti).T
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas/core/frame.py", line 2571, in transpose
return super(DataFrame, self).transpose(1, 0, **kwargs)
File "pandas/core/generic.py", line 686, in transpose
new_values = self.values.transpose(axes_numbers)
File "pandas/core/base.py", line 672, in transpose
nv.validate_transpose(args, kwargs)
File "pandas/compat/numpy/function.py", line 56, in __call__
self.defaults)
File "pandas/util/_validators.py", line 218, in validate_args_and_kwargs
validate_kwargs(fname, kwargs, compat_args)
File "pandas/util/_validators.py", line 157, in validate_kwargs
_check_for_default_values(fname, kwds, compat_args)
File "pandas/util/_validators.py", line 69, in _check_for_default_values
format(fname=fname, arg=key)))
ValueError: the 'axes' parameter is not supported in the pandas implementation of transpose()
I'm not sure off the top of my head whether this bug affects the composition switchover, just that I've seen it a bunch recently.
@TomAugspurger is this helpful or too much of a grab-bag?
the remaining pieces of the EA interface are just _from_sequence and optionally _values_for_argsort, argsort, and _reduce? _from_sequence I'm working on (#23675, #23702), and those other methods are orthogonal so could be worked on in parallel.
My main concern there is that until we inherit from ExtensionArray, those will be untested (or we'll have to duplicate tests). Unless... I suppose we could start inheriting the base tests, without actually inheriting from them yet? That may be worth exploring. Some things like pd.isna
will fail hard until we inherit, but others should work. We could also put temporary checks in things like is_extension_array_dtype
to also recognize DatetimeArray and TimedeltaArray.
I suspect the transpose bug will be fixed by inheriting from ExtensionArray.
@TomAugspurger If you have currently the time to further work on that branch to try to switch to composition, I would say: let's push for it now and first (and thus temporarily hold on for other changes). I should have time on Friday to thoroughly review it (and also a bit in the weekend), today/tomorrow I am at PyParis.
I already said that before, but the sooner we can actually switch to composition, the clearer the follow-up PRs will be (eg now the _concat_same_type
was already wrong in the PR, possibly because it was simply not exercised in all our concat related tests, because as long it is not an actual extension array, it is not used.
Also all the datetime-arithmetic related issues, they can in principle be done after the split. As long as artithmetic works for Index/Series (for which it is already tested), it's fine (of course, before releasing we should also fix + test all arithmetic on the arrays as well, but just to say it is not necessary to do that first)
let's push for it now and first (and thus temporarily hold on for other changes).
@jorisvandenbossche I would really appreciate it if you didn't advocate shutting down all progress on things that I'm putting a lot of time and effort into.
That's not how I read Joris' comment. I read "temporary hold" as... just that. A pause, not a shutting down or throwing away. I think all the effort in open PRs (and possibly some unpushed work you've done) is still vital.
There are many paths from master (plus the open PRs) to DatetimeArray. To me, a path that frontloads the switch to composition makes sense, but it's hard to say ahead of time. I've had trouble thinking through all the ramifications of a diff, partly because the current class hierarchy "feels weird" to me, and partly because I'm not familiar with this section of the code base.
Anyway, I think that https://github.com/pandas-dev/pandas/pull/23675 and https://github.com/pandas-dev/pandas/pull/23642/ are the next to go in. @jbrockmendel do you have other WIP branches changing the indexes or arrays datetimelike files that you'd like to push?
On my own availability: I'm going to be ramping up on a largish project for dask in the few days / weeks. I'll still have time for pandas, but not as much over the next month or so. So I have a slight window to dump all my time into pandas, that I'd like to take advantage of if possible.
@jbrockmendel yes, sorry if that is the way it came over. I certainly think we can do parallel work, and not all the items you listed in https://github.com/pandas-dev/pandas/issues/23185#issuecomment-435592816 would go in such a split PR, it was just about what to merge first as it will probably be easier to rebase the smaller PRs than the other way around. And also, if we do that push, there will also be a lot of reviewing work :)
And also, it will depend on the PR of course. If it is something that doesn't overlap a lot (like certain test changes), for sure it doesn't need to wait with being merged.
Have we had a design discussion on DatetimeDtype
? A short proposal:
Rename to DatetimeDtype
. We can keep the both the unit
and tz
arguments (in case we ever support resolutions other than ns
in the future), but we continue to raise on non-ns
unit.
class DatetimeDtype(ExtensionDtype):
def __new__(self, unit='ns', tz: Optional[Union[str, tzinfo]]=None):
...
We remove the "magic" creation from string.
In [3]: pd.core.dtypes.dtypes.DatetimeTZDtype('datetime64[ns, utc]')
Out[3]: datetime64[ns, utc]
that would throw an error, since 'datetime64[ns, utc]'
isn't a valid unit.
One question though... I'm worried about changing the .dtype
exposed to users via Series.dtype
and DatetimeIndex.dtype
In [3]: pd.Series(pd.date_range('2000', periods=4)).dtype
Out[3]: dtype('<M8[ns]')
I'm not sure what the ramifications of changing that to always be DatetimeDtype
would be. Of course, when a timezone is involved we'll need to the dtype to be a DatetimeDtype
, but should we use the numpy type when it suffices?
One question though... I'm worried about changing the .dtype exposed to users via Series.dtype and DatetimeIndex.dtype
Yes. What I thought before about this is that we would need to have an if/else logic there, to still return the numpy dtype if there is no tz. I would put this logic on the Series/Index, and have the array always return the extension dtype.
However, there might be places where we check the dtype of values which can be coming from Series/Index to be an ExtensionDtype? (eg to take another code path for extension arrays) So the above might be problematic for such code paths?
I would put this logic on the Series/Index, and have the array always return the extension dtype.
Agreed that's the right place to do it. I'll see if any of your concerns come up (I suspect something will).
I'll see if any of your concerns come up (I suspect something will).
I would expect many of the places where we use is_extension_array_dtype
and pass it the actual dtype and not the array or container ..
Yeah. We could change those to
is_extension_array_dtype(self._values.dtype)
where necessary.
On Thu, Nov 15, 2018 at 9:23 AM Joris Van den Bossche < notifications@github.com> wrote:
I'll see if any of your concerns come up (I suspect something will).
I would expect many of the places where we use is_extension_array_dtype and pass it the actual dtype and not the array or container ..
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439077941, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIo6kZlwqutnhAIBlFItwRfHAg8bsks5uvYb9gaJpZM4Xen4p .
Small status update here: I played with moving to composition a bit last week. The basic idea was
The biggest challenge was our is_datetime*
functions. They were breaking
in a lot of places and in strange ways
when passed a DatetimeDtype rather than an np.dtype.
Today, I've experimented with a new branch that changes the data model of
DatetimeArray slightly. DatetimeArray.dtype
is now a Union of np.dtype or DatetimeDtype. We'll use DatetimeDtype (or
keep the name as DatetimeTZDtype) when there's
a timezone, and we'll use np.dtype('M8[ns]') otherwise. This should result
in a much smaller diff. I suspect that we can later
clean up the dtypes so that DatetimeArray.dtype is always a DatetimeDtype,
but I think that need not block the release.
I'll push something up by the end of the day.
On Thu, Nov 15, 2018 at 9:25 AM Tom Augspurger tom.augspurger88@gmail.com wrote:
Yeah. We could change those to
is_extension_array_dtype(self._values.dtype)
where necessary.On Thu, Nov 15, 2018 at 9:23 AM Joris Van den Bossche < notifications@github.com> wrote:
I'll see if any of your concerns come up (I suspect something will).
I would expect many of the places where we use is_extension_array_dtype and pass it the actual dtype and not the array or container ..
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439077941, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIo6kZlwqutnhAIBlFItwRfHAg8bsks5uvYb9gaJpZM4Xen4p .
@TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?
On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.
Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.
Still just grinding away at the inheritance -> composition move. Mostly just moving around methods / adding wrappers in small places.
I haven't really touched internals yet. I'm not sure when the best time to do that would be. For DatetimeArrray, we can actually push that discussion off till after we switch things, since we already have two blocks. I'll post again when I have a better-formed opinion here.
On Mon, Nov 19, 2018 at 9:26 AM jbrockmendel notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?
On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 https://github.com/pandas-dev/pandas/pull/23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.
Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439931236, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInHRH7TnhC4sXtDzUFWQggFmpjM6ks5uws26gaJpZM4Xen4p .
I've isolated (one of?) the segfaults to DatetimeArray.new calling conversion.ensure_datetime64ns.
This segfaults on my branch
diff --git a/pandas/tests/groupby/test_apply.py
b/pandas/tests/groupby/test_apply.py
index 3bc5e51ca..e64bdc9ea 100644
--- a/pandas/tests/groupby/test_apply.py
+++ b/pandas/tests/groupby/test_apply.py
@@ -6,6 +6,13 @@ from pandas.util import testing as tm
from pandas import DataFrame, MultiIndex, compat, Series, bdate_range,
Index
+def test_apply_tz():
+ df = pd.DataFrame({'a': [1, 3, 3, 4]},
+ index=pd.DatetimeIndex(['2000', '2000', '2001',
'2001']))
+ gr = df.groupby(df.index.date)
+ gr.apply(lambda x: x.idxmax())
+
+
But passes when we don't call ensure_datetime64ns
diff --git a/pandas/core/arrays/datetimes.py
b/pandas/core/arrays/datetimes.py
index 65f6d6859..612e48792 100644
--- a/pandas/core/arrays/datetimes.py
+++ b/pandas/core/arrays/datetimes.py
@@ -258,7 +258,7 @@ class DatetimeArrayMixin(dtl.DatetimeLikeArrayMixin):
assert isinstance(values, np.ndarray), type(values)
assert is_datetime64_dtype(values) # not yet assured nanosecond
- values = conversion.ensure_datetime64ns(values, copy=False)
+ # values = conversion.ensure_datetime64ns(values, copy=False)
result = cls._simple_new(values, freq=freq, tz=tz)
if freq_infer:
I haven't figured out the actual cause yet.
On Wed, Nov 21, 2018 at 7:17 AM Tom Augspurger tom.augspurger88@gmail.com wrote:
Still just grinding away at the inheritance -> composition move. Mostly just moving around methods / adding wrappers in small places.
I haven't really touched internals yet. I'm not sure when the best time to do that would be. For DatetimeArrray, we can actually push that discussion off till after we switch things, since we already have two blocks. I'll post again when I have a better-formed opinion here.
On Mon, Nov 19, 2018 at 9:26 AM jbrockmendel notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?
On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 https://github.com/pandas-dev/pandas/pull/23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.
Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439931236, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInHRH7TnhC4sXtDzUFWQggFmpjM6ks5uws26gaJpZM4Xen4p .
Alrighty, we're close now https://github.com/TomAugspurger/pandas/tree/disown-tz-only
Right now this diff is at
89 files changed, 1859 insertions(+), 906 deletions(-)
and I have 100 xfails / skips. I'm going to spend the rest of today splitting of independent pieces, cleaning things up, and organizing the history a bit, before making a PR tonight or tomorrow.
On Tue, Nov 27, 2018 at 9:46 AM Tom Augspurger tom.augspurger88@gmail.com wrote:
I've isolated (one of?) the segfaults to DatetimeArray.new calling conversion.ensure_datetime64ns.
This segfaults on my branch
diff --git a/pandas/tests/groupby/test_apply.py b/pandas/tests/groupby/test_apply.py index 3bc5e51ca..e64bdc9ea 100644 --- a/pandas/tests/groupby/test_apply.py +++ b/pandas/tests/groupby/test_apply.py @@ -6,6 +6,13 @@ from pandas.util import testing as tm from pandas import DataFrame, MultiIndex, compat, Series, bdate_range, Index +def test_apply_tz(): + df = pd.DataFrame({'a': [1, 3, 3, 4]}, + index=pd.DatetimeIndex(['2000', '2000', '2001', '2001'])) + gr = df.groupby(df.index.date) + gr.apply(lambda x: x.idxmax()) + +
But passes when we don't call ensure_datetime64ns
diff --git a/pandas/core/arrays/datetimes.py b/pandas/core/arrays/datetimes.py index 65f6d6859..612e48792 100644 --- a/pandas/core/arrays/datetimes.py +++ b/pandas/core/arrays/datetimes.py @@ -258,7 +258,7 @@ class DatetimeArrayMixin(dtl.DatetimeLikeArrayMixin): assert isinstance(values, np.ndarray), type(values) assert is_datetime64_dtype(values) # not yet assured nanosecond - values = conversion.ensure_datetime64ns(values, copy=False) + # values = conversion.ensure_datetime64ns(values, copy=False) result = cls._simple_new(values, freq=freq, tz=tz) if freq_infer:
I haven't figured out the actual cause yet.
On Wed, Nov 21, 2018 at 7:17 AM Tom Augspurger tom.augspurger88@gmail.com wrote:
Still just grinding away at the inheritance -> composition move. Mostly just moving around methods / adding wrappers in small places.
I haven't really touched internals yet. I'm not sure when the best time to do that would be. For DatetimeArrray, we can actually push that discussion off till after we switch things, since we already have two blocks. I'll post again when I have a better-formed opinion here.
On Mon, Nov 19, 2018 at 9:26 AM jbrockmendel notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?
On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 https://github.com/pandas-dev/pandas/pull/23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.
Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-439931236, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInHRH7TnhC4sXtDzUFWQggFmpjM6ks5uws26gaJpZM4Xen4p .
Excellent, looking forward to taking a look. Are the skips segfault-free?
With a little luck many of the xfails will be fixed by implementing the remaining methods on DTA/TDA, most of which are (hopefully) near merging.
They are indeed segfault free. There's still a subtle failure involving a groupby resample coming from us doing bad stuff in Cython. We somehow manage to create a DatetimeIndex where DatetimeIndex._values is an ndarray, rather than a DatetimeArray. This causes an exception, but not a segfault.
On Wed, Nov 28, 2018 at 1:59 PM jbrockmendel notifications@github.com wrote:
Excellent, looking forward to taking a look. Are the skips segfault-free?
With a little luck many of the xfails will be fixed by implementing the remaining methods on DTA/TDA, most of which are (hopefully) near merging.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-442584247, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIknZGdH6zhZhTV2y0Dz4wsOwxKQbks5uzusdgaJpZM4Xen4p .
A thought on a way forward, seeing as how Tom has earned some down-time.
With _index_data
implemented, I'm finding the approach used in the previously-segfaulting branch (tentatively) working. Define on DTI/TDI/PI respectively:
@property
def _eadata(self):
return DatetimeArray._simple_new(self._data, freq=self.freq, tz=self.tz)
@property
def _eadata(self):
return TimedeltaArray._simple_new(self._data, freq=self.freq)
@property
def _eadata(self):
return self._data
Then do the entire inheritance/composition switchover, but dispatching to self._eadata
instead of self._data
. (Several steps later we'll remove _eadata and dispatch to _data).
This limits the diff to the index classes without changing their outward-facing behavior, making for a much more manageable scope.
Thoughts?
My vote is for getting #24024 in sooner rather than later, but I'm the most familiar with the diff so it's easier for me to go through the entire thing at once. It's blocking several changes I'd like to wrap up, and my time for pandas is limited.
My vote is for getting #24024 in sooner rather than later
AFAICT the sticking points are:
DatetimeArray.__init__
and TimedeltaArray.__init__
is a sticking point for meI have no strong opinion on which approach to take for the astype
question. For the rest, I think the best route to "sooner rather than later" is to merge #24394
A master issue, to help keep track of things.
High-level, I think we have two things to sort out
1. Design
We have a few things to sort out
a. Composition vs. inheritance of Index / Series and array classes b. ...
2. Implementation Plan
A few options
a. Big PR implementing one or more arrays, followed by smaller cleanup PRs b. Incremental PRs (implementing most of the interface on the *ArrayMixin classes), followed by a final PR making the switch c. ...
Project board with the relevant discussions / PRs: https://github.com/pandas-dev/pandas/projects/4