Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Currently, a shallow copy of a DataFrame/Series only creates a new DataFrame/Series object, while pointing to the same data and index objects under the hood. Quoting the docstring:
When deep=False, a new object will be created without copying
the calling object's data or index (only references to the data
and index are copied).
and one of the examples in the docstring:
>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)
# Shallow copy shares data and index with original.
>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True
So apart from being the long-standing behaviour, it's also clearly documentation that the Index objects (for row index and columns) are identical in the returned shallow copy.
However, under Copy-on-Write, we now updated the behaviour of a shallow copy to no longer propagate mutations to the values of the Series/DataFrame, but still returning a new Series/DataFrame with the shared memory, but protected with a reference track to ensure we will copy later on if needed.
So the question is, for shallow copies like this, when CoW is enabled, should we also protect from sharing mutable state in the index/columns attributes of the Series/DataFrame?
Current behaviour:
>>> s = pd.Series([1, 2], index=["a", "b"])
>>> shallow = s.copy(deep=False)
>>> shallow.index.name = "some_new_name"
>>> s
some_new_name # <-- modifying shallow copy also modified the parent series
a 1
b 2
dtype: int64
My proposal would be to extend the notion of "changing one object never updates another object" under CoW that we currently apply to the data values, to also apply this to the Index mutable state.
This can easily be achieved by also shallow copying the Index objects when shallow copying a DataFrame/Series.
How to shallow copy an Index?
While the Index class also has a copy method with a deep argument, and thus one can do idx.copy(deep=False) to get a new Index object sharing the same data (and this option is actually the default for Index), there is also a idx.view() method, which essentially does the same, but also sets the _id attribute of the Index, ensuring equality testing still pass the fast path check for identity (idx1.is_(idx2) is True).
We can probably use a view instead of an actual shallow copy with copy(deep=False), which will ensure we keep a faster path in several places where we check for Index equality between objects.
Consequence beyond the copy method?
It seems we actually have quite some places that are currently returning identical index objects shared by multiple objects, beyond just the typical shallow copy case. Those could / should all change with the proposal above. Some examples:
Any method that currently returns a new Series/DataFrame that preserves the shape. Those that return a shallow copy with CoW will automatically get updated by changing the behaviour of copy(deep=False), such as:
>>> s = pd.Series([1, 2])
>>> s2 = s.infer_objects()
>>> s2.index is s.index # current behaviour, would become False
True
but also any object that returns new data (and currently doesn't care about CoW) but preserves the shape currently typically reuses the index object and should probably be updated?
>>> s = pd.Series([1, 2])
>>> s2 = s.diff()
>>> s2.index is s.index # current behaviour, would become False
True
Binary / unary operations preserve the index in the result:
>>> s = pd.Series([1, 2])
>>> s2= s + 1
>>> s2.index is s.index # current behaviour, would become False
True
Reindexing with an index also would end up with not exactly that index object but with a shallow copy of it:
>>> s = pd.Series([1, 2])
>>> idx = pd.Index([0, 1, 2])
>>> s2 = s.reindex(idx)
>>> s2.index is idx # current behaviour, would become False
True
Maybe more controversially or more surprising for users if we would do this, but what about constructors where you pass an Index object?
>>> idx = pd.Index([0, 1, 2])
>>> s = pd.Series(["a", "b", "c"], index=idx)
>>> s.index is idx # should this be False as well?
True
Here it might be more logical to use exactly that object (and not create a shallow copy of it)? (but that will mean propagating changes if you passed the index object of an existing Series/DataFrame)
Related to https://github.com/pandas-dev/pandas/issues/53529, about index labels still being shared after an indexing operation.
Currently, a shallow copy of a DataFrame/Series only creates a new DataFrame/Series object, while pointing to the same data and index objects under the hood. Quoting the docstring:
and one of the examples in the docstring:
So apart from being the long-standing behaviour, it's also clearly documentation that the Index objects (for row index and columns) are identical in the returned shallow copy.
However, under Copy-on-Write, we now updated the behaviour of a shallow copy to no longer propagate mutations to the values of the Series/DataFrame, but still returning a new Series/DataFrame with the shared memory, but protected with a reference track to ensure we will copy later on if needed. So the question is, for shallow copies like this, when CoW is enabled, should we also protect from sharing mutable state in the index/columns attributes of the Series/DataFrame?
Current behaviour:
My proposal would be to extend the notion of "changing one object never updates another object" under CoW that we currently apply to the data values, to also apply this to the Index mutable state.
This can easily be achieved by also shallow copying the Index objects when shallow copying a DataFrame/Series.
How to shallow copy an Index? While the Index class also has a copy method with a
deep
argument, and thus one can doidx.copy(deep=False)
to get a new Index object sharing the same data (and this option is actually the default for Index), there is also aidx.view()
method, which essentially does the same, but also sets the_id
attribute of the Index, ensuring equality testing still pass the fast path check for identity (idx1.is_(idx2) is True
). We can probably use a view instead of an actual shallow copy withcopy(deep=False)
, which will ensure we keep a faster path in several places where we check for Index equality between objects.Consequence beyond the copy method? It seems we actually have quite some places that are currently returning identical index objects shared by multiple objects, beyond just the typical shallow copy case. Those could / should all change with the proposal above. Some examples:
Any method that currently returns a new Series/DataFrame that preserves the shape. Those that return a shallow copy with CoW will automatically get updated by changing the behaviour of
copy(deep=False)
, such as:but also any object that returns new data (and currently doesn't care about CoW) but preserves the shape currently typically reuses the index object and should probably be updated?
Binary / unary operations preserve the index in the result:
Reindexing with an index also would end up with not exactly that index object but with a shallow copy of it:
Maybe more controversially or more surprising for users if we would do this, but what about constructors where you pass an Index object?
Here it might be more logical to use exactly that object (and not create a shallow copy of it)? (but that will mean propagating changes if you passed the index object of an existing Series/DataFrame)