pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

Datetimelike Array Refactor #23185

Closed TomAugspurger closed 5 years ago

TomAugspurger commented 6 years ago

A master issue, to help keep track of things.

High-level, I think we have two things to sort out

  1. Design our datetimelike Arrays (DatetimeArray, TimedeltaArray, PeriodArray)
  2. A plan for getting from master to an implementation of that design.

1. Design

We have a few things to sort out

a. Composition vs. inheritance of Index / Series and array classes b. ...

2. Implementation Plan

A few options

a. Big PR implementing one or more arrays, followed by smaller cleanup PRs b. Incremental PRs (implementing most of the interface on the *ArrayMixin classes), followed by a final PR making the switch c. ...

Project board with the relevant discussions / PRs: https://github.com/pandas-dev/pandas/projects/4

TomAugspurger commented 6 years ago

cc @jreback @jorisvandenbossche @jbrockmendel

OK I wanted to open this issue because I've had trouble keeping up with all the goings-on around this (and I was getting tired of rebasing the WIP PeriodArray PR). So I'm hoping we can come to a rough consensus on these outstanding discussion, and a concrete plan of how to get from master to there.

If necessary, we can move to a google doc or something to collaboratively edit the design doc. I don't have a preference. Feel free to add discussion points to the "outstanding discussions" list.

I'll try to write up my current thoughts later today.

jbrockmendel commented 6 years ago

You're right (and Joris has expressed this elsewhere) that this conversation has splintered across a lot of places and centralization will help.

I'm a big fan of breaking up big problems into smaller more manageable problems (as evidenced by #23159, small-step tslibs refactor, doomed improvement efforts in statsmodels, ...). Are there any logically independent parts of the problem that can be split off?

FWIW:

23173 I opened specifically to split discussion-requiring pieces off from the other PRs

22535 does touch the mentioned files, but should be orthogonal to the topics under discussion.

23113 I hope we can push through relatively easily, since it is needed for DatetimeArray + DateOffset addition to work.

jorisvandenbossche commented 6 years ago

I mentioned it on one of the PRs, but is it OK for you guys to agree on not merging any of the open PRs, before we have some agreement on the way forward?

@jbrockmendel I understand you want to keep working on those PRs you opened (and it's also great that you do so much for pandas!), but let's maybe take a short pause doing new PRs until we agree on how we want to get to the finish-line here (in that sense: can you answer my mail regarding thursday?) Of course, doing some PRs might help to explore certain options, show how things can be done, etc. So that is certainly fine, just not finalizing / merging.

jbrockmendel commented 6 years ago

Sounds good.

TomAugspurger commented 6 years ago

Composition vs. inheritance.

The case for composition:

  1. I see EAs as being at the same level as ndararys. Series & Index box an array (ndarray or EA). I think it'd be strange for some indexes be instances of an EA, depending on the dtype.
  2. I think that the calls have only been for specific Index classes to inherit from EA. We'll limit the rest of this discussion to just Index, but I think that fact is an argument for composition.
  3. Indexes have names, arrays don't. That means we'll have similar, but different, function signatures and calls to pass names where they need to go.
  4. Some methods common to both Index and EA have different semantics (e.g. shift).

The case against:

  1. Additional boilerplate code for dispatching an operation from the Series or Index to the EA.
  2. ... (make your case here)

Since I'm pro-composition, I'll point out that I think we're required to have that dispatching code anyway. If we're a Series / Index backed by an ndarray, then we dispatch to numpy. If we're backed by an EA, then we dispatch to it. Or, to put it another way, Index / Series ops simply dispatch to the ops of the underlying array.

jbrockmendel commented 6 years ago

If there is a nice solution to the immutability/caching issue, then I can get on board with just-composition. Until then, I think both is the way to go for now. i.e. PeriodIndex subclasses PeriodArray and PeriodIndex.values returns a PeriodArray.

Your point 1) is the one I find most compelling. It would be really nice if .values had a consistent definition as always-lossless.

TomAugspurger commented 6 years ago

I'm not sure how doing both would work in practice. I don't have a good sense for what complications that's likely to create.

But, I think that won't be necessary. I think we can manage to cache attributes like hasnans or _isnan on the Array, and invalidate that cache when necessary (__setitem__).

It would be really nice if .values had a consistent definition as always-lossless.

Agreed. At this point, I think that's our only hope of having a consistent definition for what .values is.

TomAugspurger commented 6 years ago

On the caching point, this seems to work

```diff diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py index 24d4b6e55..23a5845f0 100644 --- a/pandas/core/arrays/period.py +++ b/pandas/core/arrays/period.py @@ -465,6 +465,14 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray): "Got '{}' instead.".format(type(value).__name__)) raise TypeError(msg) self._data[key] = value + self._invalidate_cache() + + def _invalidate_cache(self): + self._cache = {} + + @cache_readonly + def hasnans(self): + return self.isna().any() def take(self, indices, allow_fill=False, fill_value=None): from pandas.core.algorithms import take ```

There may be edge cases, or cases were we could skip invalidating the cache, but it's at least feasible.

It didn't come across in the diff, but the call to _invalidate_cache comes from PeriodArray.__setitem__.

"tests":

without the invalidation ```python In [1]: import pandas as pd ar In [2]: arr = pd.core.arrays.period_array(['2000', None], 'D') In [3]: arr.hasnans Out[3]: True In [4]: arr[1] = pd.Period(2000, 'D') In [5]: arr.hasnans Out[5]: True ``` with invalidation ```python In [1]: import pandas as pd In [2]: arr = pd.core.arrays.period_array(['2000', None], 'D') In [3]: arr.hasnans Out[3]: True In [4]: arr[1] = pd.Period(2000, 'D') In [5]: arr.hasnans Out[5]: False ```
jorisvandenbossche commented 6 years ago

It overlaps with the first point of Tom, but an additional case for composition / disadvantage of inheritance:

I personally also don't see any compelling case for inheritance (also the dispatching is not a reason IMO, as we need to it anyway for Series)

So personally, unless someone now actually makes an extensive and detailed argument for inheritance, I would propose to leave this discussion behind us and focus on how to solve possible remaining issues with the composition structure (the constructors, the caching).

jorisvandenbossche commented 6 years ago

For the inheritance vs composition, we actually have examples to look at in practice: interval and categorical already do composition, while the datetimelikes are currently a kind of inheritance. I think the current datetimelike implementation shows the complexity of this.

jorisvandenbossche commented 6 years ago

Related to the caching issue, I think there are several options:

  1. Assess if it is actually significant (eg to decide if it is a blocker on the PeriodArray PR). From a quick timing, calculating the _isnan property the first time takes about 5% of the time to do DatetimeIndex() + Timedelta() arithmetic operation (whether it has a similar impact when we actually would no longer cache is not necessarily the case of course, should be checked further)

  2. Implement caching on the Array objects. I think this is certainly feasible, as Tom also explores above.

  3. Factor the operations that make use of caching out into functions instead of methods (like the current _add_delta methods), that is shared between Index and Array, and where eg the (cached or not) mask can be passed as an argument.

So I think there is certainly a solution possible, and my first point about "is this important" is then more to know if this is essential to already have in an initial "big split" PR like the PeriodArray PR, or if this can be left for a follow-up PR.

jorisvandenbossche commented 6 years ago

Warning, long post coming.

My proposal for a possible way forward, let's call it the "minimally big" PR with follow-up PRs proposal:

My reasoning to go for the above way forward compared to a "first smaller clean-up PRs, then split", are the following:

This proposal would mean that master will be in a temporary "messy" state (but still green of course). If we find this a problem, we can always first merge those PRs to a refactor branch, and only after some of the follow-up PRs have been done to that branch as well, merge it into master. (personally, I don't think that is needed as long as we keep master green, as it will only give additional complexity to keep master and that branch in sync).

Doing the above, would in practice mean: first focus on some of the design discussions (eg the design of the constructors, and other issues mentioned in the top post), focus on doing and reviewing the actual splits, and for now wait with the other smaller PRs. And also in this proposal I think there is enough work to share for multiple people working at the same time (without trying to get things out of the big PRs to independent PRs). Doing those discussions, reviewing PRs, doing the Datetime or Timedelta Array, etc are also all things that need to be done and take time.


So please give your thoughts about this proposal, feedback, alternative proposals, ...

jbrockmendel commented 6 years ago

If patching __setitem__ works robustly I think that would be a great solution. It would probably be useful elsewhere, too. The fact that it would be so useful makes me think it must be harder than it looks, otherwise it would have been done long ago. But I'd be happy to be proven wrong.

A clear data model that you can explain and reason about

If we were to go whole-hog on inheritance, the data model would be "An Index is an Array with some additional fancy indexing/join/set methods (and a Block is an Array with mgr_locs attribute)". That said, conditional on the __setitem__ thing working out, I'm happy to end this part of the conversation.

Factor the operations that make use of caching out into functions instead of methods

That's an interesting idea. I'd like to give it some thought before forming an opinion.

The current smaller PRs are very difficult to judge and review because the big picture (the final code structure) is still missing

Are they though? DatetimeArray does need to_period and to_perioddelta for arithmetic operations to work. Is there any scenario in which we dont want all three arrays to have take implemented? If anything, isolating them makes it clear what depends on what, avoids the phenomenon in big PRs where I have to ask "is this related to everything else?"


My main objection to the Big PR is that it precludes working in parallel. If there was something about the implementation that required it be done All At Once that would be another matter, but as it is there is a lot of non-difficult stuff we can get out of the way before making final decisions about caching and constructors. i.e. the "Minimal Big Split" can be made more minimal.

Suppose hypothetically that two things both turn out to be more difficult than expected: the __setitem__ patching and the arithmetic methods. In the Big/Minimally Big model, everything is on hold until a single PR gets all of it right. In a parallel model, I can figure out the arithmetic while Tom figures out the caching. (For the sake of the example I'm assuming that the ways in which they are unexpectedly difficult are logically independent)


The status quo is that non-#22862 PRs are on hold. While not my first choice, I'd rather see that move forward than go in circles here. Let's see what jreback has to say and reconnoiter.

jreback commented 6 years ago

Some general comments / points.

TomAugspurger commented 6 years ago

It would be nice to share between EA implementations for DateArray / PeriodArray (which I think is accomplished by the DatetimelikeMixin), the concern is that this is also a mixin with Index?

I'm not sure, but I share this concern. I expect that well have a mixin or base class for DatetimeLikeArray with these common ops, and a base class for DatetimelikeIndex that just does the dispatching. I'm hopeful we won't need a mixin for DatetimelikeIndex.

EA's need to be come a more of a drop-in replacement for ndarrays.

We must not drop caching in either the EA's or the Indexes.

These two are slightly in tension. AFAIK, right now the only way to update an ExtensionArray inplace is with __setitem__. If we add any more, we would need to remember to manually invalidate the cache.

edit: inplace ops is another, though nans usually (always for inplace?) propagate.

TomAugspurger commented 6 years ago

The fact that it would be so useful makes me think it must be harder than it looks, otherwise it would have been done long ago.

I suspect this is because it's not on ndarrays, and we didn't have an intermediate array-like that could track these. ndarrays can be manipulated inplace in so many ways that a cache sounds infeasible.

TomAugspurger commented 6 years ago

I feel like I still lack the information to make a judgement call on how to proceed here. So, how about I spend a chunk of time getting #22862 in to a reasonable state. I'll try to make it as minimal as possible while still passing, and in the process I'll identify pieces that can be reasonably split off.

I think the biggest outstanding discussion / PR is around constructors and whether https://github.com/pandas-dev/pandas/pull/23140 should go first. I'll try to form some thoughts around that quickly.

jorisvandenbossche commented 6 years ago

Thanks for the answers!

(and sorry again for my long answers :))

My main objection to the Big PR is that it precludes working in parallel.

Because this is an important point, I had a paragraph above trying to explain why I think this does not need to be the case, as I can also argue for the opposite:

In a parallel model, I can figure out the arithmetic

What do you mean exactly here? I think @TomAugspurger already figured this out in his PeriodArray PR (at least a minimal working solution that gets the job done). Tom, correct me if I am wrong. And also, I think it is still possible to explore the caching in parallel (try it on a different array, or do it as a patch on top of the period array PR's branch)

I would like to decouple the discussions of Blocks from this conversation for now.

Fully agree here. Apart from that we have an ExtensionBlock instead of the custom ones, there should not be much changes related to blocks.

It would be nice to share between EA implementations for DateArray / PeriodArray (which I think is accomplished by the DatetimelikeMixin), the concern is that this is also a mixin with Index?

I think we will typically end up with a base class / mixin for the Arrays to share functionality there, and a mixin / base class for the datetimelike indexes to share things. I think those two mixins can be completely separate (the current Datetimelike Mixin is indeed shared between Arrays and Index, but that is just the temporary confusing state where Index/Array is not yet splitted properly)

I really would like to avoid completely changing idioms at this stage, mainly adding constructors that are non-idiomatic to pandas (e.g. this happened with RangeIndex and this completely breaks all other patterns). Like it or not we are stuck with this pattern

We are stuck with it for Index, but IMO that should not mean we should follow the exact same pattern for the Arrays (in all the different PRs related to this we have several times tried to discuss / understand what the different __new__/_simple_new/_shallow_copy/_shallow_copy_with_infer do, I really think we can make it simpler at the Array level). But will open a separate issue about that to keep that discussion separeate, as it is quite orthogonal to the rest of the workflow discussion.

EA's need to be come a more of a drop-in replacement for ndarrays. I am not advocating adding all methods as numpy has too many, but certain basic things should just work / exist (talking about .repeat() here).

We can certainly discuss, but in light of keeping the initial big-split PRs as minimal as possible / not have more discussion than needed on it, I would personally keep this discussion for a follow-up (of course, as long as the method is not essential to have the EA interface working).

jorisvandenbossche commented 6 years ago

I feel like I still lack the information to make a judgement call on how to proceed here. So, how about I spend a chunk of time getting #22862 in to a reasonable state. I'll try to make it as minimal as possible while still passing, and in the process I'll identify pieces that can be reasonably split off.

+1

I think the biggest outstanding discussion / PR is around constructors and whether #23140 should go first. I'll try to form some thoughts around that quickly.

If we focus first on the PeriodArray PR, I don't think that PR should be merged. But, it is the discussion that we had on one of the review comments (https://github.com/pandas-dev/pandas/pull/23140#discussion_r225003108) about the constructors that ideally indeed should be resolved. I will try to look at that now as well, and open a separate issue about it (unless you beat me to it).

TomAugspurger commented 6 years ago

What do you mean exactly here? I think @TomAugspurger already figured this out in his PeriodArray PR (at least a minimal working solution that gets the job done). Tom, correct me if I am wrong.

Ops work on the PeriodArray PR by dispatch. Though maybe @jbrockmendel meant ops with caching.

        # PeriodIndex.__add__
        def __add__(self, other):
            # dispatch to ExtensionArray implementation
            result = self._data.__add__(other)
            return wrap_arithmetic_op(self, other, result)
TomAugspurger commented 6 years ago

small FYI: I've updated the original post with a list of PeriodArray blockers that aren't touching any of the datetimelike files (e.g. https://github.com/pandas-dev/pandas/pull/23155). Those are bugs in master that hopefully have a clear fix which won't lead away from our end goal.

jorisvandenbossche commented 6 years ago

Though maybe @jbrockmendel meant ops with caching.

Since he mentioned both in the same sentence as two different things, I assumed this was not the case. So hence the question for clarification :-)

jbrockmendel commented 6 years ago

It would be nice to share between EA implementations for DateArray / PeriodArray (which I think is accomplished by the DatetimelikeMixin), the concern is that this is also a mixin with Index?

The plan as I understand it is that DatetimeLikeArrayMixin will be mixed into the EA subclasses but will cease to be mixed in to DatetimeLikeIndexMixin.

Though maybe @jbrockmendel meant ops with caching.

I meant extending the tests in tests/arithmetic to include the EA subclasses. Since this was just a hypothetical example of "two things going wrong at the same time", let's not spend too much time on it.

I feel like I still lack the information to make a judgement call on how to proceed here. So, how about I spend a chunk of time getting #22862 in to a reasonable state. I'll try to make it as minimal as possible while still passing, and in the process I'll identify pieces that can be reasonably split off.

Sounds good.

In the interim, I'd like to get exceptions to the datetimelike PR moratorium for the following, which I think should have minimal overlap:

I'll hold off pending an explicit OK.

TomAugspurger commented 6 years ago

I pushed an update to https://github.com/pandas-dev/pandas/pull/22862/files that reduced the scope somewhat. Outside of indexes/period.py and ararys/period.py, there shouldn't be any extraneous changes.

I can work to reduce the changes to indexes/period.py and ararys/period.py a bit, but I'd like to get the constructors nailed down first.

TomAugspurger commented 6 years ago

Arithmetic methods in all four core.arrays files

Do you have a plan / WIP for this? My WIP PeriodArray PR doesn't do too much there I think.

jbrockmendel commented 6 years ago

I do have a branch about ready for the arithmetic fixes, will open a PR later today.

TomAugspurger commented 6 years ago

I replaced the list of issues with a simple board at https://github.com/pandas-dev/pandas/projects/4

I think everyone has permission to add / move things around there.

TomAugspurger commented 6 years ago

Last biggish design discussion: what is .values for the index and series? Do we want to have that discussion here, or split it off?

I think @jbrockmendel's preference is for .values to be lossless. I think this is a good principle to shoot for. We can't say that it's always an ndarray, (categorical), so lossless / no-copy is a good alternative.

We have some intermingling factors here:

  1. datetime64[ns, tz] can be converted to UTC, but this isn't 100% lossless. We don't know the original timezone, and there's probably weirdness with folds around DST transitions.
  2. datetime64[ns] (no tz) and timedelta64[ns] can be represented losslessly as an ndarray. If we return an EA for datetime64[ns, tz] Do we want to return an EA for these as well? Returning an EA would be "internally consistent", but would be a larger API change.

I think my preference is for

jorisvandenbossche commented 6 years ago

We already have another issue for that no?

Personally, in the interest of getting the split PRs like the PeriodArray PR merged rather quickly, I would leave out the discussion on this from those PRs (or at least the decision, we can of course already discuss). In other words, not see it as a blocker for those PRs. The way that keeping it a separate discussion is possible, is because internally we should just never use .values, but ._values which is always the EA if possible. And the public .values can then always easily be changed to either ndarray or EA later on, if we made a decision.

jbrockmendel commented 5 years ago

Jotting down some thoughts on how to move forward on implementing DatetimeArray/TimedeltaArray (DTA/TDA). In particular thinking about reasonably-scoped steps.

1) Nail down constructors. De-duplicate. Handle all the relevant cases so we can add DTA/TDA to the parametrized box in arithmetic tests. 2) "eventually" things: ABC classes, __repr__, tm.assert_foo_equals 3) offsets arithmetic with DTA/TDA (reasonably sized, ready-made tests, and tightly-contained scope) 4) Arithmetic methods for TDA: __mul__, __div__, __divmod__, ...; unary methods including __neg__ 5) Array-like methods that turn out to be needed: is_monotonic, is_monotonic_increasing, is_unique, searchsorted, ravel. These are things I've found are needed in the do-it-all-at-once branch; I'll document+test exactly why they are each needed when we get there. 6) EA Interface methods. This is the point where we need to consider doing the rest all-at-once (or more specifically the point where I expect @jorisvandenbossche will voice a strong, well-thought-out opinion) 7) Change from Inheritance to Composition. 7b) Change _values etc and the ensuing deluge of Series and internals behavior. Change tests where expected output is affected. Document.

The biggest stumbling point ATM in my all-at-once branch is in internals where dimension checks are failing. Hopefully I'll figure that out by the time we reach 7.

jorisvandenbossche commented 5 years ago

(didn't look at any other issue or PR that has been done on this in the last days, so possibly missing some context)

Please please can you check to what extent it is possible to do step 7 first ? (or in a first step, combined with other things if needed). Eg for steps 3 and 4, I don't understand why they are necessarily related to this process? (arithmetic is already working, and should keep working on array/index the same as it is now, and if there are still bugs, they don't need to mixed in this process?)

It might well be that Timedelta/DatetimeArray is more complicated than the PeriodArray (or that the PeriodArray PR was less complicated exactly because the arithmetic being shared with the datetime-likes was not yet splitted from index), and that it is more difficult here to do a similar PR as was done for PeriodArray. But I really think it would be good to try to have the current step 7 as early as possible in the process.

The biggest stumbling point ATM in my all-at-once branch is in internals where dimension checks are failing.

Can you clarify this a bit more concretely?

Array-like methods that turn out to be needed ... I'll document+test exactly why they are each needed when we get there.

Can you already be a bit more specific here? (so we can discuss it here)

You're "all in one" branch is https://github.com/pandas-dev/pandas/pull/23415 ?

BTW, thanks for working on this!

jbrockmendel commented 5 years ago

Please please can you check to what extent it is possible to do step 7 first ? (or in a first step, combined with other things if needed).

I have a (local-only ATM) branch that is attempting to do this. In this branch I add a _eadata attribute to the Index classes and dispatch all the relevant methods/properties to that; this lets me keep _data as-is and at-least-in-theory leave values/_values etc (and so external behavior) unchanged. Large sections of the tests are now passing and many of the failures look straightforward to troubleshoot, but I'm getting segfaults all over the groupby, resample, and window tests. Since this necessarily implicates the cython code, this indicates problems outside of the DTA/TDA code that need to be troubleshot. I am still working on this branch, but do not think it should be the main line of attack.

(arithmetic is already working, and should keep working on array/index the same as it is now, and if there are still bugs, they don't need to mixed in this process?)

We aren't really testing arithmetic on the array classes themselves. That's why I want to get the constructors to the point where we can extend the existing tests to include them. If we did, we would find that a) offsets' apply_index methods don't work with the Array classes, b) TimedeltaArray does not have __mul__, __div__, etc, (and also __neg__ that it needs indirectly).

The biggest stumbling point ATM in my all-at-once branch is in internals where dimension checks are failing. Can you clarify this a bit more concretely?

Lots of errors resembling:

self = BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: DatetimeIndex(...       dtype='datetime64[ns]', freq='B')
DatetimeBlock: 2 dtype: datetime64[ns]
blocks = [DatetimeBlock: 2 dtype: datetime64[ns]]
axes = [Index([u'A', u'B'], dtype='object'), DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
       ...   '2000-02-10', '2000-02-11'],
              dtype='datetime64[ns]', freq='B')]
do_integrity_check = True

    def __init__(self, blocks, axes, do_integrity_check=True):
        self.axes = [ensure_index(ax) for ax in axes]
        self.blocks = tuple(blocks)

        for block in blocks:
            if block.is_sparse:
                if len(block.mgr_locs) != 1:
                    raise AssertionError("Sparse block refers to multiple "
                                         "items")
            else:
                if self.ndim != block.ndim:
                    raise AssertionError(
                        'Number of Block dimensions ({block}) must equal '
                        'number of axes ({self})'.format(block=block.ndim,
>                                                        self=self.ndim))
E                   AssertionError: Number of Block dimensions (1) must equal number of axes (2)

pandas/core/internals/managers.py:118: AssertionError

Array-like methods that turn out to be needed ... I'll document+test exactly why they are each needed when we get there. Can you already be a bit more specific here? (so we can discuss it here)

When we call frequencies.infer_freq (indirectly while within __new__ ATM) it does lookups on is_unique, is_monotonic_increasing, is_monotonic_decreasing. Off the top of my head I don't recall where it became needed, but I had to implement __array__ and I think at one point searchsorted, though it looks like that has been deleted so maybe made unnecessary.

You're "all in one" branch is #23415 ?

No, the all-at-once branch is also local.

jorisvandenbossche commented 5 years ago

The biggest stumbling point ATM in my all-at-once branch is in internals where dimension checks are failing.

Can you clarify this a bit more concretely?

Lots of errors resembling:

You will need to change the DatetimeBlock to be a non-consolidatable block, or eventually an ExtensionBlock. Did you already do that? I was wondering: this might be something we could do first in a separate step / PR?

When we call frequencies.infer_freq (indirectly while within new ATM) it does lookups on is_unique, is_monotonic_increasing, is_monotonic_decreasing. Off the top of my head I don't recall where it became needed

OK, I see, it is indeed used in _FrequencyInferer. I think this class can be assumed to (after the refactor) only get Datetime/TimedeltaArray, and we can always start with using a private attribute there if we want to leave the discussion whether those Arrays should have a is_unique or is_monotonic_increasing attributes out of the refactor discussion and rather keep it as a follow-up issue.

We aren't really testing arithmetic on the array classes themselves. That's why I want to get the constructors to the point where we can extend the existing tests to include them.

We should indeed test those, and it is good you found bugs that way. The only thing I tried to say that IMO this is rather independent from the actual refactor. It can perfectly be done after the initial split-PR? If so, I would first focus on that.

TomAugspurger commented 5 years ago

Catching up on this now after a few days away. Is this the recommended place to start?

Feel free to take over https://github.com/pandas-dev/pandas/projects/4

We aren't really testing arithmetic on the array classes themselves.

Don't we inherit tests for those from BaseArithmeticTests when we implement the EA interface? Or are you saying we don't test them currently?

What are you plans for blocks right now? Will those all become non-consolidatable / removed?

jbrockmendel commented 5 years ago

Catching up on this now after a few days away. Is this the recommended place to start?

Yes. https://github.com/pandas-dev/pandas/issues/23185#issuecomment-435592816 a few comments up is pretty up-to-date. #23493 and #23502 are the other active strands of work here.

We aren't really testing arithmetic on the array classes themselves.

Don't we inherit tests for those from BaseArithmeticTests when we implement the EA interface? Or are you saying we don't test them currently?

I'm referring to tests/arithmetic/ where we parametrize over box=[DataFrame, Series, Index] and not yet box=[DataFrame, Series, Index, Array]. The arithmetic methods are not currently tested on the array classes themselves. #23502 starts the process of fixing that. #23493 will make it much easier to extend that to cover more of the tests (b/c ATM many of the tests call constructor([scalar1, scalar2, ...]) which will be supported on DTA/TDA after 23493)

What are you plans for blocks right now? Will those all become non-consolidatable / removed?

Main thought is to put that off as long as possible. Implement as much as I can without changing the external behavior to isolate decision-making. Given that the branch (just pushed https://github.com/jbrockmendel/pandas/tree/compose) that tries this is riddled with segfaults, this suggests the presence of other significant bugs that should be fixed before expanding the scope.

Secondary thought is that if we supported 2-D EAs, we could avoid a lot of the special-casing in blocks. Very happy to leave that discussion to a later date.

jorisvandenbossche commented 5 years ago

Given that the branch (just pushed https://github.com/jbrockmendel/pandas/tree/compose) that tries this is riddled with segfaults

I tried out your branch, and I don't get segfaults, so not sure what could be going on here (didn't run the full test suite, but I ran the tests of test_groupby.py after removing your skips for segfaults). Does it segfault on travis?

What are you plans for blocks right now? Will those all become non-consolidatable / removed?

Main thought is to put that off as long as possible.

I think this is actually something that could be done rather independent, or already before doing the actual "split" refactor. Because we can make the current DatetimeBlock non-consolidatable while keeping the current existing M8[ns] storage. (not fully sure if it can be done after the split, without needing to modify the current DatetimeBlock. Might be possible as well, but I would need to take a closer look to know. But that would not fully exercise the EA internally then)

jbrockmendel commented 5 years ago

I tried out your branch, and I don't get segfaults, so not sure what could be going on here

This is interesting. What versions/system were you on? Most of my testing was on ubuntu18.04/py2.7 (quick check, still get them in 3.6 on the same machine)

I didn't run it on Travis, but on CircleCI I got a segfault-adjacent error (FWIW in previous posts where I have referenced segfaults some of them are literally "segmentation fault", for others I am using it as a catch-all for related errors):

pandas/tests/test_resample.py ..F.....*** Error in `/opt/conda/envs/pandas/bin/python': malloc(): smallbin double linked list corrupted: 0x0000555749983e90 ***

Just now I locally I got (still ubuntu18+py27)

python -m pytest pandas/tests/test*.py --skip-slow
[...]
pandas/tests/test_resample.py ..F..............corrupted size vs. prev_size
Aborted (core dumped)

As for the rest of your comment, if there are steps that can be done in parallel, I encourage you to give it a go.

jorisvandenbossche commented 5 years ago

OK, if I do the above like python -m pytest pandas/tests/test*.py --skip-slow, I also get the segfaults...

So it seems to be triggered by running multiple tests, because if you only run the segfaulting tests itself, they don't segfault (or al least not the ones I tried). I also didn't succeed to reproduce a segfault with a standalone example outside of the tests (eg converting one of the segfaulting test into a script, then it does not sefault)

jbrockmendel commented 5 years ago

So it seems to be triggered by running multiple tests, because if you only run the segfaulting tests itself, they don't segfault (or al least not the ones I tried).

The main things I remember noticing when trying to track this down were a) the "exec" usage + redirection in core.groupby makes troubleshooting a PITA and b) things were being mutated that shouldn't be, in particular a DatetimeIndex. Some of the segfault-like errors had messages about "double free" and "memory corruption", so I assume those are problems in _libs.

General thought is to try to make progress on the arrays as standalone classes and then revisit this fresh. That will also give me a chance to make another pass through the cython code which may turn something up.

TomAugspurger commented 5 years ago

in case it helps with the segfaults on your compose branch @jbrockmendel, pytest-faulthandler can dump a traceback:

pandas/tests/test_resample.py::TestResampleAPI::test_groupby_resample_on_api PASSED                                                                  [ 19%]
pandas/tests/test_resample.py::TestResampleAPI::test_pipe PASSED                                                                                     [ 23%]
pandas/tests/test_resample.py::TestResampleAPI::test_getitem PASSED                                                                                  [ 28%]
pandas/tests/test_resample.py::TestResampleAPI::test_select_bad_cols PASSED                                                                          [ 33%]
pandas/tests/test_resample.py::TestResampleAPI::test_attribute_access PASSED                                                                         [ 38%]
pandas/tests/test_resample.py::TestResampleAPI::test_api_compat_before_use PASSED                                                                    [ 42%]
pandas/tests/test_resample.py::TestResampleAPI::tests_skip_nuisance PASSED                                                                           [ 47%]
pandas/tests/test_resample.py::TestResampleAPI::test_downsample_but_actually_upsampling PASSED                                                       [ 52%]
pandas/tests/test_resample.py::TestResampleAPI::test_combined_up_downsampling_of_irregular PASSED                                                    [ 57%]
pandas/tests/test_resample.py::TestResampleAPI::test_transform PASSED                                                                                [ 61%]
pandas/tests/test_resample.py::TestResampleAPI::test_fillna PASSED                                                                                   [ 66%]
pandas/tests/test_resample.py::TestResampleAPI::test_apply_without_aggregation Fatal Python error: Segmentation fault

Current thread 0x00007fff90af5380 (most recent call first):
  File "/Users/taugspurger/sandbox/pandas/pandas/core/arrays/datetimes.py", line 238 in __new__
  File "/Users/taugspurger/sandbox/pandas/pandas/core/indexes/datetimes.py", line 368 in _simple_new
  File "/Users/taugspurger/sandbox/pandas/pandas/core/indexes/datetimes.py", line 333 in __new__
  File "/Users/taugspurger/sandbox/pandas/pandas/core/indexes/datetimelike.py", line 279 in __getitem__
  File "/Users/taugspurger/sandbox/pandas/pandas/core/internals/managers.py", line 1517 in get_slice
  File "/Users/taugspurger/sandbox/pandas/pandas/core/series.py", line 863 in _get_values
  File "/Users/taugspurger/sandbox/pandas/pandas/core/groupby/ops.py", line 850 in _chop
  File "/Users/taugspurger/sandbox/pandas/pandas/core/groupby/ops.py", line 835 in __iter__
  File "/Users/taugspurger/sandbox/pandas/pandas/core/groupby/ops.py", line 185 in apply
  File "/Users/taugspurger/sandbox/pandas/pandas/core/groupby/groupby.py", line 697 in _python_apply_general
  File "/Users/taugspurger/sandbox/pandas/pandas/core/groupby/groupby.py", line 679 in apply
  File "/Users/taugspurger/sandbox/pandas/pandas/core/groupby/generic.py", line 742 in apply
  File "/Users/taugspurger/sandbox/pandas/pandas/tests/test_resample.py", line 261 in test_apply_without_aggregation

https://github.com/pytest-dev/pytest-faulthandler

TomAugspurger commented 5 years ago

A message printed to stdout that pytest suppressed (pass -s)

Python(59478,0x7fff90af5380) malloc: *** error for object 0x7fe458f5c4e8: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
Fatal Python error: Aborted

Current thread 0x00007fff90af5380 (most recent call first):
  File "/Users/taugspurger/sandbox/pandas/pandas/core/arrays/datetimes.py", line 239 in __new__
  File "/Users/taugspurger/sandbox/pandas/pandas/core/indexes/datetimes.py", line 368 in _simple_new
  File "/Users/taugspurger/sandbox/pandas/pandas/core/indexes/datetimes.py", line 333 in __new__
  File "/Users/taugspurger/sandbox/pandas/pandas/core/indexes/datetimelike.py", line 279 in __getitem__
  File "/Users/taugspurger/sandbox/pandas/pandas/core/internals/managers.py", line 1517 in get_slice

the actual array passed there is

['2005-01-05T06:20:00.000000000' '2005-01-05T06:21:00.000000000'
 '2005-01-05T06:22:00.000000000' '2005-01-05T06:23:00.000000000'
 '2005-01-05T06:24:00.000000000' '2005-01-05T06:25:00.000000000'
 '2005-01-05T06:26:00.000000000' '2005-01-05T06:27:00.000000000'
 '2005-01-05T06:28:00.000000000' '2005-01-05T06:29:00.000000000'
 '2005-01-05T06:30:00.000000000' '2005-01-05T06:31:00.000000000'
 '2005-01-05T06:32:00.000000000' '2005-01-05T06:33:00.000000000'
 '2005-01-05T06:34:00.000000000' '2005-01-05T06:35:00.000000000'
 '2005-01-05T06:36:00.000000000' '2005-01-05T06:37:00.000000000'
 '2005-01-05T06:38:00.000000000' '2005-01-05T06:39:00.000000000']

which seems fine.

jbrockmendel commented 5 years ago

Good to know about pytest-faulthandler, thanks. I'll get back to working on that branch later this week.

jbrockmendel commented 5 years ago

Post-call update on game-plan here:

The scopes for these are chosen to be close to orthogonal so we can push on them in parallel. Once the first three of these are through, the next steps I have in mind are:

Once DTA/TDA are working+tested stand-alone entities, then we can worry about what names to give which constructors, as long as it doesn't interfere with...

TomAugspurger commented 5 years ago

Thanks for the update. I'll take a look at those tonight.

I'm planning to experiment with internals/blocks.py a bit, to see how much of that can be simplified beforehand.

jorisvandenbossche commented 5 years ago

Yes, thanks for the update. Do you still have segfaults in #23415 ?

implement the rest of the EA interface+tests

If this is about having the base EA tests: from my experience, it helps to do this from the beginning (when starting to subclass from ExtensionArray), to check you are have everything for the EA interface. At the end of the call we are talking shortly about whether this could already be done as a separate PR (without the actual split inheritance -> composition), but probably that will break some things, so this probably belongs in #23415

jbrockmendel commented 5 years ago

Do you still have segfaults in #23415 ?

There were never segfaults in that branch; https://github.com/jbrockmendel/pandas/tree/compose is the branch with the segfaults. Nothing there has changed for a few days.*

If this is about having the base EA tests

I am far more interested in the parametrized tests in tests/arithmetic.

* Only tangentially related, I was thinking it would be nice to have a GH Tag for segfaults (or other non-recoverable errors). There aren't that many of them, but they merit extra attention. Thoughs?

jorisvandenbossche commented 5 years ago

Do you still have segfaults in #23415 ?

There were never segfaults in that branch

Ah yes, sorry, was confusing them. What is the main difference between both branches? #23415 is not yet doing the inheritance/composition switch? (but the plan is to do it in that PR at some point?)

If this is about having the base EA tests

I am far more interested in the parametrized tests in tests/arithmetic.

Yes, I know :-) But we still want those EA base tests once they are actual EAs. Maybe this can be done separately in advance, but not sure (can maybe take a look at that). In any case, IMO we should not do it after making them actual EAs.

I was thinking it would be nice to have a GH Tag for segfaults (or other non-recoverable errors). There aren't that many of them, but they merit extra attention. Thoughs?

Are there currently open issues with reported segfaults? In case so, yes that sounds good.

TomAugspurger commented 5 years ago

@jbrockmendel any open PRs from https://github.com/pandas-dev/pandas/pulls/jbrockmendel that could use review? (https://github.com/pandas-dev/pandas/pull/23415 isn't quite ready, right?)

Any relevant pieces issues that can be offloaded?

I'm going to mess around with DatetimeBlock shortly, to see what pieces can be simplified there.

jbrockmendel commented 5 years ago

any open PRs [...] that could use review?

Any relevant pieces issues that can be offloaded?

Not at the moment. Two coming up in the next hour or so: one extending timedelta64 arithmetic tests to TimedeltaArray, another implementing most of the rest of the EA interface for DTA/TDA. The latter doesn't yet have EA tests in place, which can absolutely be offloaded.

The other one coming up a little later is implementing DatetimeArray._from_sequence (or more specifically, the datetime64 analogue of #23539)

TomAugspurger commented 5 years ago

For those curious about what step 7 (inheritance to composition) might look like, see https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:disown-datetimelike Right now, import pandas works, but most everything beyond that is broken.

e16dd61 https://github.com/pandas-dev/pandas/commit/e16dd6140716c56d46980a527a8fd27125696bf1 is +50 net LOC, and has everything for getting "import pandas" working, but doesn't change the actual data for the various Index classes. https://github.com/pandas-dev/pandas/commit/3af610928005d84f8e038faf83bda95c9c4f9c4f starts on that, but isn't working yet.

@jbrockmendel thoughts on how where we should proceed for the next few steps? I'd like to get us over to composition soon rather than later. I think that'll help me better judge your step 5 (extra EA methods that need to be implemented).

There's already merge conflicts on that branch since this morning, so I'll not plan to work on it further unless we have consensus that it's time to make the switch.

On Mon, Nov 12, 2018 at 10:37 AM jbrockmendel notifications@github.com wrote:

any open PRs [...] that could use review?

Any relevant pieces issues that can be offloaded?

Not at the moment. Two coming up in the next hour or so: one extending timedelta64 arithmetic tests to TimedeltaArray, another implementing most of the rest of the EA interface for DTA/TDA. The latter doesn't yet have EA tests in place, which can absolutely be offloaded.

The other one coming up a little later is implementing DatetimeArray._from_sequence (or more specifically, the datetime64 analogue of #23539 https://github.com/pandas-dev/pandas/pull/23539)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23185#issuecomment-437941700, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIgwhQkA5aWh85OylF1lxopb5Efo4ks5uuaA3gaJpZM4Xen4p .