pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.58k stars 17.9k forks source link

Proposal for future copy / view semantics in indexing operations #36195

Closed TomAugspurger closed 11 months ago

TomAugspurger commented 4 years ago

Update: concrete proposal for Copy-on-Write: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing


[original post]

Hi all,

During the recent sprint we discussed copy / view semantics in indexing operations. I've tried to summarize a proposal that came out of it.

The tldr is

  1. Indexing operations always return a view where possible (always when indexing columns, if possible when indexing the rows).
  2. For consistent behavior, we implement either Copy on Write or Error on Write (explained in the proposal)
  3. For readability / understandability, we give users the tools to control in code whether mutations to a "child" dataframe (result of an indexing operation) propagate to the parent, while avoiding unnecessary "defensive" copies.

The full proposal is available at https://docs.google.com/document/d/1csGE4qigPR2vzmU2--jwURn3sK5k5eVewinxd8OUPk0/edit. I'd like to collect comments there and then propose properly by adding it to the roadmap.

cc @pandas-dev/pandas-core

jbrockmendel commented 4 years ago

asv from a branch that changes column-based indexing to always return a view:

       before           after         ratio
     [42289d0d]       [f99abd57]
     <master>         <myway>   
-     9.24±0.06ms      8.77±0.02ms     0.95  algorithms.Hashing.time_series_string
-        57.3±2ms       54.3±0.2ms     0.95  join_merge.MergeAsof.time_by_int('backward', 5)
-      76.1±0.3ms       72.0±0.3ms     0.95  frame_methods.Describe.time_dataframe_describe
-        57.3±1ms       54.0±0.2ms     0.94  join_merge.MergeAsof.time_by_int('backward', None)
-     14.0±0.07ms      13.2±0.08ms     0.94  groupby.MultiColumn.time_cython_sum
-        1.29±0ms      1.20±0.01ms     0.94  timeseries.AsOf.time_asof_nan_single('DataFrame')
-     1.50±0.01ms      1.38±0.01ms     0.92  timeseries.AsOf.time_asof_single('DataFrame')
-     3.62±0.01ms      3.31±0.01ms     0.91  groupby.CountMultiDtype.time_multi_count
-      67.0±0.4ms       60.7±0.4ms     0.91  reshape.Crosstab.time_crosstab_normalize_margins
-      14.8±0.4ms      13.2±0.08ms     0.89  join_merge.MergeAsof.time_on_int32('nearest', 5)
-     14.4±0.09ms      12.8±0.04ms     0.89  join_merge.MergeAsof.time_on_int32('nearest', None)
-      14.7±0.1ms      13.0±0.06ms     0.89  join_merge.MergeAsof.time_on_uint64('nearest', 5)
-     14.3±0.05ms      12.6±0.03ms     0.88  join_merge.MergeAsof.time_on_uint64('nearest', None)
-      69.8±0.5ms       61.4±0.7ms     0.88  reshape.PivotTable.time_pivot_table_margins
-     14.3±0.06ms       12.6±0.1ms     0.88  join_merge.MergeAsof.time_on_int('nearest', 5)
-     12.7±0.05ms      11.1±0.04ms     0.87  join_merge.MergeAsof.time_on_int32('forward', 5)
-     14.0±0.06ms      12.3±0.05ms     0.87  join_merge.MergeAsof.time_on_int('nearest', None)
-     12.5±0.05ms      10.9±0.05ms     0.87  join_merge.MergeAsof.time_on_int32('forward', None)
-     12.1±0.05ms      10.5±0.02ms     0.87  join_merge.MergeAsof.time_on_int32('backward', 5)
-     12.0±0.05ms      10.3±0.04ms     0.86  join_merge.MergeAsof.time_on_int32('backward', None)
-     12.7±0.09ms      10.9±0.04ms     0.86  join_merge.MergeAsof.time_on_uint64('forward', 5)
-     12.2±0.06ms      10.5±0.07ms     0.85  join_merge.MergeAsof.time_on_int('forward', 5)
-     12.4±0.03ms      10.6±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('forward', None)
-     12.0±0.06ms      10.2±0.04ms     0.85  join_merge.MergeAsof.time_on_int('forward', None)
-     11.9±0.05ms      10.1±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('backward', 5)
-     11.8±0.04ms      9.98±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('backward', None)
-      11.4±0.4ms      9.61±0.02ms     0.84  join_merge.MergeAsof.time_on_int('backward', None)
-     11.5±0.06ms      9.71±0.04ms     0.84  join_merge.MergeAsof.time_on_int('backward', 5)
-         407±3μs          291±2μs     0.72  reindex.Reindex.time_reindex_columns
-      3.99±0.1ms       2.72±0.2ms     0.68  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ge')
-     3.56±0.02ms      2.36±0.03ms     0.66  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ne')
-      3.60±0.2ms      2.35±0.02ms     0.65  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('lt')
-     47.8±0.08ms       28.4±0.1ms     0.59  frame_methods.Dropna.time_dropna_axis_mixed_dtypes('all', 1)
-      4.81±0.9ms      2.81±0.03ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('truediv')
-     4.13±0.04ms       2.41±0.2ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('eq')
-        11.9±1ms       6.88±0.5ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('pow')
-     4.12±0.04ms      2.34±0.05ms     0.57  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('le')
-        4.67±1ms       2.55±0.2ms     0.55  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('mul')
-        4.90±1ms       2.55±0.2ms     0.52  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('sub')
-      40.3±0.2ms       20.5±0.2ms     0.51  frame_methods.Dropna.time_dropna('all', 1)

Methodological note: i did a full asv run, then for everything that came back as changed re-ran those and changes that were not reproducible (e.g. tslibs benchmarks which have to be noise)

TomAugspurger commented 4 years ago

I'd like to move forward with this, including a version of the proposal on https://pandas.pydata.org/pandas-docs/dev/development/roadmap.html.

There hasn't been much activity on the google doc, https://docs.google.com/document/d/1csGE4qigPR2vzmU2--jwURn3sK5k5eVewinxd8OUPk0/edit. The biggest outstanding item is a decision / discussion of "Copy on Write" vs. "Error on Write". It'd be nice if the proposal was explicit on this point, but I worry we won't be able to decide without actually trying it out.

jorisvandenbossche commented 4 years ago

Some assorted thoughts I gathered the last weeks (not necessarily all needing an answer in an initial discussion / proposal, though):

Dr-Irv commented 4 years ago

I think the crucial question that deserves more discussion is: are we fine with ditching the SettingWithCopyWarning for those cases it was introduced for? Because there is a reason that it was introduced: to warn uses that were doing a setitem operation that might have no effect silently, and thus be confusing / easy to miss for users.

I think the classical example is something like df[df['B'] > 4]['B'] = 10 - a chained indexing assignment that doesn't work. The warning was introduced to prevent people doing that and silently having no effect. Are we fine with it that this will indeed have silently no effect? (which would be the consequence of copy-on-write)

For me, this is one of the compelling arguments to go with "Error on Write". Then we are forcing people to be explicit about their intent, instead of depending on a behavior that is inconsistent (today) and for which we might not figure out all of the boundary cases if we chose "copy-on-write". Also, I've seen the following behavior with people I've worked with: 1) Ignoring the SettingWithCopyWarning, passing that code on to someone else who then has to deal with it 2) Writing a function that works perfectly fine in test scenarios with data read from a CSV file, but then handing off that function to someone else to use, who passes a different DataFrame that was constructed via some filtering operation, which then causes the warning to be issued. 3) Writing some code that is then passed to a totally different team, which treats all warnings as errors, in which case the code has to be fixed

If we silently do copy-on-write, I'm concerned that these types of issues wouldn't get found by people using pandas.

TomAugspurger commented 4 years ago

Agreed that the classic chained indexing example is a compelling argument for Error on Write rather than Copy on Write.

What to do with duplicate column selection? (df[['a', 'a']]) Should that also still be a view? (if we would go for views) Mutating one column would then also mutate the other.

IMO, yes that should create two views to the same data.

What kind of mutation operations do we consider in this discussion? We clearly talked about "changing values in-place (setitem)", but there are also other kinds of mutation like changing metadata, flags, ..

I'd like to keep this focused just on setting data / values. I'm hopeful that the metadata / flags issue can be solved by doing shallow copies by default and providing an option to copy the data if desired, like we did for set_flags.

What do we do with accessing a single column as a Series?

If possible, treating it the same as a DataFrame seems best to me.

jreback commented 4 years ago

is there any system that does error-on-write? I do prefer this, but am a little concerned that we are treading new group vs. copy-on-write (which I think has a large body of systems / literature).

jorisvandenbossche commented 4 years ago

Putting it here because most of the discussion on this topic is still here (but can also move to https://github.com/pandas-dev/pandas/pull/36988).

While the classic chained indexing example is indeed an argument for Error-on-Write (for this case, it's basically what we have now but with an error instead of warning), I think there are also arguments/examples to give in favor of Copy-on-Write:

Also, I've seen the following behavior with people I've worked with: ...

In this specific case, having consistent Copy-on-Write could also "solve" it? If is provides consistent behaviour (always copy, never mutate the passed data), not depending on the actual data passed to the function. (of course, it depends on a lot on the actual case, and you were only giving a generic flow. But making things consistent in itself (always copy-on-write or always error-on-write) could also help avoiding such issues, I think? (while now you can actually mutate or not, when you get (and ignore) warning)

jorisvandenbossche commented 4 years ago

Assume that we could actually detect chained indexing? In this hypothetical situation, would people then be more in favor of Copy-on-Write over Error-on-Write?

The assumption would be that we can detect the difference in __setitem__ between

df[df['B'] > 4]['B'] = 10

vs

temp = df[df['B'] > 4]
temp['B'] = 10

I am wondering if we shouldn't be able to actually detect this based on reference counting. In the first case, there is no additional reference to df[df['B'] > 4] than the calling dataframe (self) of the second __setitem__ operation. (I am not sure how robust such a detection would be though)

Dr-Irv commented 4 years ago

In this specific case, having consistent Copy-on-Write could also "solve" it? If is provides consistent behaviour (always copy, never mutate the passed data), not depending on the actual data passed to the function. (of course, it depends on a lot on the actual case, and you were only giving a generic flow. But making things consistent in itself (always copy-on-write or always error-on-write) could also help avoiding such issues, I think? (while now you can actually mutate or not, when you get (and ignore) warning)

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through. The other issue for me is that copy-on-write silently does something, which might have side effects that a user would like to be made aware of, whereas error-on-write makes the user very aware of what he is doing.

jorisvandenbossche commented 4 years ago

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through.

Isn't the same true for Error-on-Write? We also need to detect all cases where we need to error instead of allow the mutation.

jreback commented 4 years ago

@jorisvandenbossche you have just described the current implementation of SettingCopyWarning and it's entire purpose

it is tracking refs

but i don't actually think this is that robust - and thats the same reason i am not enthusiastic about copy on write

yes if we can guarantee that we can always do it great

but that's the problem i am not sure we really can

though maybe now that we have full control over PandasArrays it might be fully possible (way back when when i did SettingWithCopy) some ops just indexes directly to the numpy arrays and we had to be very defensive about this

today we have much more robust handling

Dr-Irv commented 4 years ago

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through.

Isn't the same true for Error-on-Write? We also need to detect all cases where we need to error instead of allow the mutation.

Yes, I agree. But if we miss a case, we fix it by raising an exception. With copy-on-write, we fix it by doing the copy, but I'm guessing that there might be cases where we say "Given this case, we should have chosen error-on-write".

My sense is that it is easier to go from the current state of SettingWithCopyWarning to error-on-write and then to copy-on-write, than to go from SettingWithCopyWarning to copy-on-write and then discover we have to do error-on-write because we simply missed something. At least as a first step, we can change all the SettingWithCopyWarning stuff to just raise an exception and we have a first implementation. (That's my sense - I could be totally wrong)

TomAugspurger commented 4 years ago

Just to clarify the choice, is it fair to say that the only real choice is between

  1. Error on Write
  2. Copy on Write with SettingWithCopyWarning for chained indexing?

The initial proposal (and my PR) left out the warning specifically for chained indexing part. If we can reliably detect chained indexing (a big if?) then do people prefer option 2?

jorisvandenbossche commented 4 years ago

@jorisvandenbossche you have just described the current implementation of SettingCopyWarning and it's entire purpose

@jreback It might be achieved using the same mechanism (I would need to look more into the code of this, it's not a part I am very familiar with), but at least the goal I described is different: the current SettingWithCopyWarning warns in both example cases, while I want only the first (actual chained indexing) to raise, and not the second (assigning the subset to a variable).

Yes, I agree. But if we miss a case, we fix it by raising an exception. With copy-on-write, we fix it by doing the copy, but I'm guessing that there might be cases where we say "Given this case, we should have chosen error-on-write".

@Dr-Irv I don't fully understand your last sentence.
It's true that, in case of a bug in the detection code when to copy/error, the fix is more explicit with error-on-write ("loud", since people will start seeing an error if they relied on wrong behaviour, instead of simply no longer getting the wrong behaviour). So do you mean that even if we choose copy-on-write, we need to fix cases by raising an error instead of copying because it would otherwise be a "silent" fix?

jorisvandenbossche commented 4 years ago

Just to clarify the choice, is it fair to say that the only real choice is between

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

jbrockmendel commented 4 years ago

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

I prefer this over the status quo.

Dr-Irv commented 4 years ago

@Dr-Irv I don't fully understand your last sentence. It's true that, in case of a bug in the detection code when to copy/error, the fix is more explicit with error-on-write ("loud", since people will start seeing an error if they relied on wrong behaviour, instead of simply no longer getting the wrong behaviour). So do you mean that even if we choose copy-on-write, we need to fix cases by raising an error instead of copying because it would otherwise be a "silent" fix?

No, I mean let's say that we choose copy-on-write. We think we get it right. We release the code. Someone discovers a place where we didn't get it right, and then we discuss it, and realize we should have chosen "error-on-write" in the first place. I'm just erring on the side of caution by preferring error-on-write.

jorisvandenbossche commented 4 years ago

let's say that we choose copy-on-write. We think we get it right. We release the code. Someone discovers a place where we didn't get it right, and then we discuss it, and realize we should have chosen "error-on-write" in the first place. I'm just erring on the side of caution by preferring error-on-write.

OK, understand it now ;) But, I personally think we should discuss / choose what we actually prefer. Let's assume that after more discussions we think we prefer copy-on-write over error-on-write, then I would say we should go with that, even if that makes it more difficult to still change mind afterwards. This also touches upon implementation / adoption questions which we didn't really discuss yet. I think we will need some way to opt-in to this new behaviour initially / have a flag to enable it / .. as an experimental stage in which we can still make breaking changes (like going from copy-on-write to error-on-write, which in principle could also still be done with deprecations first). This might be quite complex to first have both old and new behaviour side by side, but not sure there is a good alternative ..

(and to be clear: while I am mainly arguing here for Copy-on-Write to give some counter-arguments, I am not yet sure myself which of the two is "best". I mainly think it's something to futher explore/experiment/discuss)

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

I prefer this over the status quo.

I also prefer that over status-quo. But, @jbrockmendel, do you also prefer that over the copy-on-write/error-on-write alternatives discussed? (personally, I think that I am currently in favor of one of those two other options)
Note that also in this case of "clear rules", the discussion around what to do with the "typical chained indexing example" is still needed (eg if we have the rule that subsetting rows is always a copy, do we stop raising SettingWithCopyWarning for cases like subset = df[mask] with subsequent changes ?)

Dr-Irv commented 4 years ago

Here's something that I brought up in the October, 2020, dev meeting. Consider:

>>> df = pd.DataFrame({"x":[1,2,3], "y":[4,5,6]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
>>> sx=df['x']
>>> sx.iloc[1] = 10
>>> sx
0     1
1    10
2     3
Name: x, dtype: int64
>>> df
    x  y
0   1  4
1  10  5
2   3  6
>>> sdf = df[['x']]
>>> type(sdf)
<class 'pandas.core.frame.DataFrame'>
>>> sdf
    x
0   1
1  10
2   3
>>> sdf.loc[1,'x'] = 20
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> sdf
    x
0   1
1  20
2   3
>>> df
    x  y
0   1  4
1  10  5
2   3  6

Right now, the behavior of picking a single column using the syntax df['x'], which creates a Series versus df[['x']] which creates a DataFrame is different, in that when you get a Series, modification of that series silently modifies the underlying DataFrame while when you get a DataFrame, you get the SettingWithCopyWarning. If we choose copy-on-write, then someone who used the df['x'] syntax with the dependence that the underlying DataFrame was modified will no longer have working code, and will have a hard time figuring out that a new version of pandas is what caused the issue.

On the other hand, if we use error-on-write, then in BOTH cases, we raise an error, which makes it clear to the user that they need to make their intent clear.

jbrockmendel commented 4 years ago

Do either the CoW or EoW allow for the possibility that when i edit a view I am doing so intentionally?

Dr-Irv commented 4 years ago

Do either the CoW or EoW allow for the possibility that when i edit a view I am doing so intentionally?

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

jbrockmendel commented 4 years ago

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

Yes. A case where I want to modify both.

Dr-Irv commented 4 years ago

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

Yes. A case where I want to modify both.

Yes, the proposal has a as_mutable_view() method to allow the user to indicate they want to allow mutation. Works whether we do COW or EOW.

toobaz commented 4 years ago

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

Sorry, I'm missing the difference between these "clear rules" and the "whenever possible", could you clarify?

I'm also missing one thing in the proposal: how can something like df[["A"]].as_mutable_view() be implemented? Either df[["A"]] already provides a mutable view - hence the method call is noop - or it provides a copy of data - and it will be painfully hard to keep track and edit the original data (right?).

A syntax I would understand would be something like df.v[["A"]] - where .v is lazy and only operates when indexed. But even leaving aside the syntax, are we thinking this should work for all indexing operations?! For instance df.loc[['a', 'b'], slice(1, 10, 4)]?! Not that it's impossible, but it does looks difficult to me.

TomAugspurger commented 4 years ago

I'm also missing one thing in the proposal: how can something like df[["A"]].as_mutable_view() be implemented?

Indexing columns would always give views. The difference comes when you try to mutate the view

>>> df2 = df1[["A']]  # df2 is a view on df1
>>> df2.iloc[:, :] = 0  # Copies (with Copy on Write) or raises (with Error on Write).

And for the case of "I want to mutate df1, you'd use df2 = df1[["A"].as_mutable_view().

are we thinking this should work for all indexing operations?

I think this discussion is limited to

If that sounds right, I'll clarify it in the proposal document.

toobaz commented 4 years ago

Indexing columns would always give views. The difference comes when you try to mutate the view

Yep, I had completely misunderstood, now I got it, as well as the difference between the "clear rules" and the "whenever possible": that the former would stay more or less with the status quo of creating mutable views.

If I now got it right, I still have one question: what makes us different from numpy, where the status quo is sensible and does not seem to be under discussion?

Is it because our users have different needs, or are less smart? I don't think so.

Is it because our implementation is different? The only relevant difference I see is that any indexing on columns could return views... but this is still a "clear rule", not particularly difficult to grasp.

So my feeling is that SettingWithCopyWarning was born only because our rules were not clear, and that this is a problem we can fix, and that the proposal is significantly more complicated for users to understand.

In the long run, we could remove the warning - or make it optional - and provide a way to set the CoW and/or EoW behavior. All of this should not require more code than what is being discussed here, but it would not change the default behavior. And by the way EoW would be annoying in many cases (including the internals) as a default, but it would be a great tool as an optional.

This said, I do acknowledge that this was discussed live, so I hope I'm not saying something completely obvious - and if I am, please feel free to tell me.

Coming back to the syntax, however,

And for the case of "I want to mutate df1, you'd use df2 = df1[["A"].as_mutable_view().

I think this is dangerous. as_mutable_view() is something that basically does not affect df2 (with CoW), or affects it only in the sense of not raising errors (with EoW), while it does affect df1 (because it allows its content to change). This is not intuitive.

If I currently write user_provided_method(df1[["A"]]) I know (because it is the default) that user_provided_method might change df1. But if the default becomes CoW or EoW, I might expect df1 to be safe. Instead the user can still call as_mutable_view(). So either I do user_provided_method(df1[["A"]].copy(deep=True)) (possibly wasting a copy), or we establish a hierarchy by which copy_if_needed() wins over a later as_mutable_view() (and assume that users know both, and their hierarchy). Isn't it much cleaner to have df2 = df1.mutable[["A"]] (lazy), where mutability cannot be applied retroactively (at least through the API)?

jbrockmendel commented 4 years ago

what makes us different from numpy, where the status quo is sensible and does not seem to be under discussion?

The status quo is not sensible. e.g.

arr = pd.array([1, 2, 3])   # <- IntegerArray
df = pd.DataFrame({"A": arr})
df2 = df[["A"]]

df3 = df.astype("i8")
df4 = df3[["A"]]

df2 is a view on df, while df4 is a copy of df3.

Sensible behavior would be either always-copy or always-view. I maintain

a) there isn't a compelling reason to treat df[["A"]] differently from df["A"] b) "indexing on columns always gives views" is easier to document/explain/grok than any mixed alternative c) views are better for perf.

toobaz commented 4 years ago

The status quo is not sensible. e.g.

Sorry, I was referring to the numpy status quo, presumably amended with "any indexing on columns returns views". So it would be "a (mutable) view is always returned except when rows are indexed in such a way that would return a copy in numpy [with lists/slices of labels being assimilated to lists/slices of positions]". I'm all for stating and enforcing clear rules with selected API changes, while I'm afraid that the more drastic changes to default behavior (CoW or EoW) are not what users expect.

TomAugspurger commented 4 years ago

If I now got it right, I still have one question: what makes us different from numpy, where the status quo is sensible and does not seem to be under discussion?

I think the primary difference is that DataFrames are "more columnar" than ndarrays. I would claim that

  1. Slicing a subset of the columns df[['A', 'B']] is more common than slicing a subset of the second axis of an ndarray `ndarray[[:, [0, 1]]
  2. Our data structure allows for more things to be views (at the "cost" of less consolidation). In NumPy any kind of fancy indexing / boolean masking results in a copy. We can support fancy indexing (integer or label-based) or boolean masks on the columns returning views.
toobaz commented 4 years ago

I think the primary difference is that DataFrames are "more columnar" than ndarrays.

I totally agree. To my eyes, your two points are precisely what justifies the "any indexing on columns returns views" variation - not the CoW or EoW as default.

jorisvandenbossche commented 4 years ago

Regarding the example of @Dr-Irv at https://github.com/pandas-dev/pandas/issues/36195#issuecomment-708605199:

If we choose copy-on-write, then someone who used the df['x'] syntax with the dependence that the underlying DataFrame was modified will no longer have working code, and will have a hard time figuring out that a new version of pandas is what caused the issue.

If we choose copy-on-write, and also use this for Series access, we for sure need to raise warnings about that this will no longer happen in the future / is no longer happening for a long time, I think. Because it's indeed true that this would be a tricky case to upgrade for (but with good warnings, still doable IMO). But, whetherdf['x'] will stop being a view is still something we need to discuss / decide. It's not explicitly mentioned in the proposal, and also the reason I raised this question above https://github.com/pandas-dev/pandas/issues/36195#issuecomment-702195996.

[@toobaz ] I'm also missing one thing in the proposal: how can something like df[["A"]].as_mutable_view() be implemented? are we thinking this should work for all indexing operations?

[@TomAugspurger] I think this discussion is limited to

  • Indexing just the columns
  • (maybe) Indexing the rows in a way where NumPy can return a view when slicing

To add to this: as_mutable_view() will "work" (in the sense of being available) for all indexing operations. But how it "works" is that when the subset is actually a copy already (so mutating the parent is not possible), the method will raise an error saying that it cannot give you a mutable view.
In which cases it won't raise an error, that's indeed the list that Tom provided.

[@toobaz, about df2 = df1[["A"].as_mutable_view()] I think this is dangerous. as_mutable_view() is something that basically does not affect df2 (with CoW), or affects it only in the sense of not raising errors (with EoW), while it does affect df1 (because it allows its content to change). This is not intuitive.

If I currently write user_provided_method(df1[["A"]]) I know (because it is the default) that user_provided_method might change df1. But if the default becomes CoW or EoW, I might expect df1 to be safe. Instead the user can still call as_mutable_view(). So either I do user_provided_method(df1[["A"]].copy(deep=True)) (possibly wasting a copy), or we establish a hierarchy by which copy_if_needed() wins over a later as_mutable_view() (and assume that users know both, and their hierarchy). Isn't it much cleaner to have df2 = df1.mutable[["A"]] (lazy), where mutability cannot be applied retroactively (at least through the API)?

I would say that it is the responsibility of the function to document whether it will mutate input or not, but indeed, you can of course not be sure about this (like it is the case now as well). Note that if we do CoW or EoW, we would have a copy method that only "copies if needed", to avoid the "possibly wasting a copy". This is actually also one of the main drivers (at least for me) to get away from "status quo". As currently you need to waste a copy when eg doing boolean indexing, where you know you already have a copy of the original dataframe, but need to do another copy to avoid getting warnings (but to be clear: the "clear rules" option without the current warnings would also solve that).

Having a way to get this mutable view that doesn't do this retroactively (like your df2 = df1.mutable[["A"]]) is an interesting idea (if we would opt for EoW/CoW). That might indeed give more explicit control over mutability to the "owner" of the dataframe (so, if you as a user didn't make the subset mutable, a function receiving your dataframe also won't be able to do it).

toobaz commented 4 years ago

Note that if we do CoW or EoW, we would have a copy method that only "copies if needed", to avoid the "possibly wasting a copy".

Yes, we just need to make sure/clear this cannot be overridden by a later .as_mutable_view().

jorisvandenbossche commented 4 years ago

I think it would be useful to work out a bit more what the option of "clear rules" could look like, to be able to better compare it to CoW or EoW. (because also here there are still different options, and open questions about what to do with warnings ..)


One possible set of well-defined copy/view rules for indexing:

One potential addition to the above two rules could be:

With this addition, we would basically follow numpy rules for row selection, but have our own rule (always a view) for columns, since we can do this for any type of column selection given the columnar nature of a DataFrame. But the addition also complicates the rules a bit more (at least for people not familiar with numpy's rules, but I think this is a large part of our user base).

Also in this case, the question comes up of what to do with the "classical chained indexing" example:

# raise a warning or not?
# According to our "clear rules" this will not update `df`, but it still remains a
# gotcha like it always has been (the reason we actually have this warning)
df[df['B'] > 4]['B'] = 10

# raise a warning or not? 
# For me one of the reasons for this whole discussion is that the below can be done (in some way) without warning
subset = df[df['B'] > 4]
subset['B'] = 10

A different possible set of well-defined copy/view rules for indexing:

And again there could be an additional rule for slicing rows (although here that would only apply for slicing rows while selecting a single column as a Series, then)

This is another set of "clear rules" that I think is closer to current behaviour (so you could see it as a "rationalization" of the current behaviour). BTW, returning a "copy" for selecting multiple columns doesn't necessarily mean we need to return a physical copy, as long as it appears a copy to the user, to follow those "clear rules". Although with this note, this is actually a variant of CoW ..

Also here, the same question about warning or not for df[df['B'] > 4]['B'] = 10 applies.

toobaz commented 4 years ago

With this addition, we would basically follow numpy rules for row selection

Indeed this is what I had this in mind, but I had overlooked that in numpy selecting a single row results in a view, which is not something we want, as long as in the long term we don't expect pandas to support 2D data blocks (a Series should really not have to look for its data scattered in N different 1D blocks).

So if this long term assumption holds, I would be uncertain between indexing rows

... and in the end I would have a slight preference for the former as long as the latter is very easy to do (e.g. as_mutable_view, whatever the syntax).

Also here, the same question about warning or not for df[df['B'] > 4]['B'] = 10 applies.

To my eyes, the entire discussion on chained indexing is the consequence of not having clear rules. If we do have clear rules and a handy way to choose between copies and views, we do not need to warn users.

My preferred solution is:

In the docs, we'll just warn users that chained indexing like df[df['B'] > 4]['B'] = 10 is not guaranteed to work, but if they want to make sure it works (where possible), they can do df.v[df['B'] > 4].v['B'] = 10 (unless we make .c and .v inheritable - in the spirit of the proposal - so that df.v[df['B'] > 4]['B'] = 10 is enough, but we need a third accessor to go back to "normal mode").

Dr-Irv commented 4 years ago

My preferred solution is:

  • df.loc: "clear rules"
  • df.v.loc: a view whenever possible, raise otherwise
  • df.c.loc: CoW Users who want to play it safe use df.c.loc. Users who look at performance can use df.v.loc to avoid undetected copies. Users who don't have special needs (who I think are the majority) will go on as always.

I like the concept of having explicit methods for getting views. (e.g. df.v.loc).

The issue is what to do with __getitem__(). @toobaz what rules would you have if .loc was not being used, e.g. df['A'], df[['A']], df[['A','B']], df[df['A']>3]['B'], df[df['A']>3][['B']] ?

So I'll throw out another idea, which is that we leave existing behavior as it is today, but we implement the code from @jbrockmendel that returns views of a DataFrame as a new method on DataFrame. So if you want performance, you do df.view[['A','B']] and that gives you a view, but df[['A','B']] gives a copy as it does today. Then the following sequence would give the SettingWithCopyWarning:

dfv = df.view[['A','B']]
dfv.iloc[0,'A'] = 10  # Yields warning - it's a view
jorisvandenbossche commented 4 years ago

Then the following sequence would give the SettingWithCopyWarning:

That doesn't enable the usecase of "I want a view so I can mutate inplace", though. If you explicitly ask for a view, I would expect to not get a warning? (the user knows they have a view, so should also know that mutating the subset mutates the parent?)

[@toobaz] we'll just warn users that chained indexing like df[df['B'] > 4]['B'] = 10 is not guaranteed to work, but if they want to make sure it works (where possible), they can do df.v[df['B'] > 4].v['B'] = 10

That specific example cannot actually work. A boolean mask always gives a copy, so you're mutating a column of a copy. (unless we would start with "lazy filtering", keeping track of the selection filter, without doing the actual selection until some next operation. A very interesting topic, but we probably already have enough on out plate in this discussion .. ;))

toobaz commented 4 years ago

what rules would you have if .loc was not being used,

... depends on what rules we settle on, but..

e.g. df['A']

same as df.loc[:, 'A']

df[['A']]

same as df.loc[:, ['A']]

df[['A','B']]

same as df.loc[:, ['A', 'B']]

df[df['A']>3]['B']

same as df.loc[df['A']>3].loc[:, 'B']

df[df['A']>3][['B']]

same as df.loc[df['A']>3].loc[:, ['B']]

I think these identities should be valid already and should stay valid whatever decisions we take. Also because the rules shouldn't make any reference to loc vs. _getitem_ vs. iloc... only to type of indexer and/or dimensionality of the result.

toobaz commented 4 years ago

That specific example cannot actually work

Correct! I took a bad example. My argument would hold for f['B']['x'] = 10.

df.v[df['B'] > 4].v['B'] = 10 would actually raise which is good because it informs the user it's not working as intended.

Dr-Irv commented 4 years ago

That doesn't enable the usecase of "I want a view so I can mutate inplace", though. If you explicitly ask for a view, I would expect to not get a warning? (the user knows they have a view, so should also know that mutating the subset mutates the parent?)

That's new functionality, but today, you can mutate in place if you grab a single Series. So we would still allow that kind of mutate-in-place. We just wouldn't allow the new functionality of getting a view that returns a DataFrame and mutating that view, but that doesn't work today anyway.

What I am suggesting here (just create a new method to return a view of a DataFrame when asking for multiple columns) would then allow us to preserve all the existing behavior, get the benefits of performant views of a DataFrame when you want it, and where you don't need to modify it, and might be an incremental step towards the eventual behavior where we have "rules" that determine whether a view or copy is returned using the existing operators.

jorisvandenbossche commented 3 years ago

I would like to revive this discussion.

From thinking a bit more about this the last months (and re-reading the discussion above), I think I am currently most convinced of the following concrete proposal:

In addition, we can still have some way to get an actual view (whether it is the as_mutable_view() from the original proposal, or another indexer like df.view[..] or df.v.loc[..], that's still to be seen).

Another addition: I think we can also extend this to creation methods (creating a DataFrame/Series from existing data, eg DataFrame({'col': arr_or_series}), or adding columns like df["new_col"] = arr_or_series), where the resulting DataFrame/Series would by default also never modify the original using C-o-W (and for constructors this could be overriden with an explicit copy=False).

Motivations:

toobaz commented 3 years ago

I think I am currently most convinced of the following concrete proposal:

I like it. Just one objection, which however does not undermine my support for your proposal:

  • Assignment with chained indexing will basically never work in this case, which makes that we can always warn, or actually even error (to be very explicit) for it

My understanding is that warnings/errors would still be based on some sort of counting references, the implementation of which would not be significantly simplified. But I don't see this as a big stopper - after all, in Python, [1, 2, 3, 4][2:4].append(3) doesn't emit any error or warning.

Another clarification: when you think to CoW as the default, that basically means that any structure that stores data refers to a specific pandas structure as "parent", and if a write operation by any other pandas structure which is not its parent happens, then a copy is created beforehand (with the other pandas structure set as parent), correct?

jbrockmendel commented 3 years ago

we will need to implement a form of C-o-W anyway

Assuming away complexity/perf concerns, something like this does sound like our best option. Do we have anything like a POC, or anything on the horizon?

If we aren't likely to implement CoW in the near-future, we shouldn't let the perfect be the enemy of the good.

jorisvandenbossche commented 3 years ago

Do we have anything like a POC, or anything on the horizon? If we aren't likely to implement CoW in the near-future, we shouldn't let the perfect be the enemy of the good.

I am doing some small experiments at the moment, and if there is some level of agreement on the way to go, I am planning to work on it in the coming month.

jorisvandenbossche commented 3 years ago
  • Assignment with chained indexing will basically never work in this case, which makes that we can always warn, or actually even error (to be very explicit) for it

[@toobaz] My understanding is that warnings/errors would still be based on some sort of counting references, the implementation of which would not be significantly simplified. But I don't see this as a big stopper - after all, in Python, [1, 2, 3, 4][2:4].append(3) doesn't emit any error or warning.

The implementation will indeed involve some sort of reference counting, but might not necessarily be the same as we have now. I explicitly want to distinguish between:

subset = df[df['b'] <0]
subset['a'] = 0

and

df[df['b'] <0]['a'] = 0

where only the second would raise a warning/error (while currently both warn). Based on a quick experiment some time ago, I think it is doable to detect specifically the second case (which is a case where we know that 1) the user intents to modify df while 2) df will actually never be modified).

It's true that modifying objects with chained indexing eg with Python lists also doesn't work. But since the above actually worked for a long time in pandas (if the row / column indexer is reversed), I think it would be good to explicitly warn/error for this case.

jorisvandenbossche commented 3 years ago

Gentle ping for this discussion. Are there more thoughts on my proposal for "always copy through copy-on-write" at https://github.com/pandas-dev/pandas/issues/36195/#issuecomment-786654449 ?

jbrockmendel commented 3 years ago

I am doing some small experiments at the moment, and if there is some level of agreement on the way to go, I am planning to work on it in the coming month.

Any progress to report? If not, my thoughts remain roughly https://github.com/pandas-dev/pandas/issues/36195#issuecomment-786740079

jorisvandenbossche commented 3 years ago

One aspect I am running into while working on a proof-of-concept implementation is: can we still have "shallow" copies? (aka what will df.copy(deep=False) do?)

Currently, when making a copy of a DataFrame (DataFrame.copy()), it will make a "deep" copy by default (copy the actual data of the columns), but you have the option of deep=False to have a "shallow" copy: a new DataFrame object (with a new BlockManager object) but referencing the same arrays:

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

This means you can have behaviour like this:

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3]})
>>> df2 = df.copy(deep=False)

# mutating existing columns of df2 also updates original df
>>> df2.iloc[0, 0] = 100
>>> df
     a    b
0  100  0.1
1    2  0.2
2    3  0.3

# but df2 is a separate BlockManager, so adding new columns/rows doesn't update original df
>>> df2['c'] = 1
>>> df
     a    b
0  100  0.1
1    2  0.2
2    3  0.3

However, I am not sure we can reliably preserve such functionality when also having Copy-on-Write.

Assume the following case:

df = pd.DataFrame(...)
subset = df[1:3]
subset2= subset.copy(deep=False)

The subset dataframe actually views the same data as df because it was constructed using a slice (but it keeps track of that reference, so with C-o-W, mutating subset or df will trigger a copy). But now I created a "shallow" copy of subset. According to the current logic, mutating subset2 should also update subset (and vice versa), but should not update df.

Currently, I am simply keeping track of references (with weakref), so that we know that we have to copy an array on setitem if there is any other object also referencing that array. However, to support the above logic, we would also need to keep track of a second type of references for cases where it's actually a view for the user (though shallow copy), and thus both objects need to be kept synchronized. (concrete example: when mutating subset (eg subset.iloc[0, 0] = 100), C-o-W will trigger a copy of the array for that column, because it references data of df. But if we want to keep the "shallow" relationship between subset and subset2, this setitem operation on subset should also replace that array in subset2)

Adding such logic will certainly make it a lot more complicated, but I am also not sure it can even be done reliably (and also adds new dubious corner cases, e.g. what if I added a column to subset2 in the meantime, are the other columns still views on the columns of subset?)

Are there actually use cases where df.copy(deep=False) is useful? And then I mean specifically the aspect of it that mutations are propagated, because if you don't need this aspect, C-o-W already gives you the aspect of "new dataframe object without copying the data".

So maybe we can change the "shallow" copy idea to mean the "C-o-W delayed copy" concept, so that df2 = df.copy(deep=False) would be equivalent to df2 = df[:] ? (new object referencing same underlying data, but mutations are not propagated)

jbrockmendel commented 3 years ago

These are some tough issues, thanks for taking a look at them.

The subset dataframe actually views the same data as df because it was constructed using a slice (but it keeps track of that reference, so with C-o-W, mutating subset or df will trigger a copy). But now I created a "shallow" copy of subset. According to the current logic, mutating subset2 should also update subset (and vice versa), but should not update df.

The example you gave subset = df[1:3] is slicing along rows. Does anything become easier if you focus on CoW for only indexing along columns? The existing behavior for slicing on rows is simple+predictable, so if it is appreciably easier to implement CoW only for columns, that would be worth considering.

Are there actually use cases where df.copy(deep=False) is useful? And then I mean specifically the aspect of it that mutations are propagated

Nothing comes to mind.

jorisvandenbossche commented 3 years ago

Does anything become easier if you focus on CoW for only indexing along columns? The existing behavior for slicing on rows is simple+predictable, so if it is appreciably easier to implement CoW only for columns, that would be worth considering.

Based on my current understanding / implementation, it wouldn't help. Consider the following case:

df = pd.DataFrame(...)
subset = df[1:3]
subset2= subset[['col1', 'col2']]

Here, when limiting CoW for only column-indexing, subset would be an actual view on df, while subset2 is a CoW-view on subset. This means that when modifying subset2, it shouldn't modify the parent subset, but also the other way around: modifying the parent subset should not modify the child subset2. To achieve this, CoW would ensure we copy the column array in subset when modifying it. But that would break the view-relationschip between df and subset.

Basically, when implementing this in pandas in python (it's not a language-level feature), I don't think we can reliably mix real views and CoW views between pandas objects (DataFrames / Series). So I think that for CoW, it would be all or nothing.

jorisvandenbossche commented 3 years ago

I cleaned up my experimental branch with a proof of concept I did a month ago -> see https://github.com/pandas-dev/pandas/pull/41878 with a Copy-on-Write implementation