pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.77k stars 17.97k forks source link

API: with CoW, should every new Series/DataFrame object also have its own new Index objects? #53721

Open jorisvandenbossche opened 1 year ago

jorisvandenbossche commented 1 year ago

Related to https://github.com/pandas-dev/pandas/issues/53529, about index labels still being shared after an indexing operation.

Currently, a shallow copy of a DataFrame/Series only creates a new DataFrame/Series object, while pointing to the same data and index objects under the hood. Quoting the docstring:

When deep=False, a new object will be created without copying the calling object's data or index (only references to the data and index are copied).

and one of the examples in the docstring:

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

# Shallow copy shares data and index with original.
>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True

So apart from being the long-standing behaviour, it's also clearly documentation that the Index objects (for row index and columns) are identical in the returned shallow copy.

However, under Copy-on-Write, we now updated the behaviour of a shallow copy to no longer propagate mutations to the values of the Series/DataFrame, but still returning a new Series/DataFrame with the shared memory, but protected with a reference track to ensure we will copy later on if needed. So the question is, for shallow copies like this, when CoW is enabled, should we also protect from sharing mutable state in the index/columns attributes of the Series/DataFrame?

Current behaviour:

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> shallow = s.copy(deep=False)
>>> shallow.index.name = "some_new_name"
>>> s
some_new_name      # <-- modifying shallow copy also modified the parent series
a    1
b    2
dtype: int64

My proposal would be to extend the notion of "changing one object never updates another object" under CoW that we currently apply to the data values, to also apply this to the Index mutable state.

This can easily be achieved by also shallow copying the Index objects when shallow copying a DataFrame/Series.

How to shallow copy an Index? While the Index class also has a copy method with a deep argument, and thus one can do idx.copy(deep=False) to get a new Index object sharing the same data (and this option is actually the default for Index), there is also a idx.view() method, which essentially does the same, but also sets the _id attribute of the Index, ensuring equality testing still pass the fast path check for identity (idx1.is_(idx2) is True). We can probably use a view instead of an actual shallow copy with copy(deep=False), which will ensure we keep a faster path in several places where we check for Index equality between objects.

Consequence beyond the copy method? It seems we actually have quite some places that are currently returning identical index objects shared by multiple objects, beyond just the typical shallow copy case. Those could / should all change with the proposal above. Some examples:

phofl commented 1 year ago

I think this is a consequence out of the CoW rules, so probably have to do this. The last point should also make a shallow copy imo