pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.61k stars 17.9k forks source link

BUG: ChainedAssignmentError for CoW not working for chained inplace methods when passing `*args` or `**kwargs` #56456

Open jorisvandenbossche opened 10 months ago

jorisvandenbossche commented 10 months ago

While working on https://github.com/pandas-dev/pandas/pull/56402 (ensuring full test coverage for the warning we raise when you do chained inplace methods), we ran into another corner case where the refcount differs, and mixes up the refcount-based inference about whether we are being called in a "chained" context and thus have to raise the ChainedAssignment warning. (a previous case we discovered was when the setitem call is coming from cython code: https://github.com/pandas-dev/pandas/issues/51315)

Specifically, when passing *args or **kwargs to the inplace method call, it takes a different execution path in the Python interpreter compared to calling it with manually specified arguments. And starting from Python 3.11, this other execution path gives one reference less.

Example (with CoW, the df will never be updated with such chained method call):

>>> pd.options.mode.copy_on_write = True
>>> df = DataFrame({"a": [1, 2, np.nan], "b": 1})
>>> df["a"].fillna(0, inplace=True)
ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object.

That warning works for all tested Python versions. However, when passing the args or kwargs, the warning doesn't work on Python >= 3.11:

>>> pd.options.mode.copy_on_write = True
>>> df = DataFrame({"a": [1, 2, np.nan], "b": 1})
>>> args = (0, )
>>> df["a"].fillna(*args, inplace=True)
# no warning

For now we decided to punt on this specific corner case, but creating this issue to we keep track of what we learned and have something to reference to when this might come up later.

jorisvandenbossche commented 10 months ago

Some more details on the inner Python interpreter details about why this is happening: when calling the function with *args or **kwargs, i.e. with a variable number of arguments, the opcode CALL_FUNCTION_EX is being used.
When calling the function in a standard way with specified keywords, prior to Python 3.11 the opcode CALL_FUNCTION_KW was being used. But with Python 3.11, this has been optimized, and this opcode is replaced with KW_NAMES + CALL opcodes. And apparently this optimized implementation results in less references to the object on which the method is called.

Example disassembly (opcodes) for Python 3.10 vs 3.11 for the above fillna call Python 3.10 (both cases have ref count of 4): ``` In [7]: dis.dis(lambda: df['a'].fillna(1, inplace=True)) 1 0 LOAD_GLOBAL 0 (df) 2 LOAD_CONST 1 ('a') 4 BINARY_SUBSCR 6 LOAD_ATTR 1 (fillna) 8 LOAD_CONST 2 (1) 10 LOAD_CONST 3 (True) 12 LOAD_CONST 4 (('inplace',)) 14 CALL_FUNCTION_KW 2 16 RETURN_VALUE In [8]: dis.dis(lambda: df['a'].fillna(*args, inplace=True)) 1 0 LOAD_GLOBAL 0 (df) 2 LOAD_CONST 1 ('a') 4 BINARY_SUBSCR 6 LOAD_ATTR 1 (fillna) 8 LOAD_GLOBAL 2 (args) 10 LOAD_CONST 2 ('inplace') 12 LOAD_CONST 3 (True) 14 BUILD_MAP 1 16 CALL_FUNCTION_EX 1 18 RETURN_VALUE ``` Python 3.11: ``` In [8]: dis.dis(lambda: df['a'].fillna(*args, inplace=True)) 1 0 RESUME 0 2 LOAD_GLOBAL 1 (NULL + df) 14 LOAD_CONST 1 ('a') 16 BINARY_SUBSCR 26 LOAD_ATTR 1 (fillna) 36 LOAD_GLOBAL 4 (args) 48 LOAD_CONST 2 ('inplace') 50 LOAD_CONST 3 (True) 52 BUILD_MAP 1 54 CALL_FUNCTION_EX 1 56 RETURN_VALUE In [9]: dis.dis(lambda: df['a'].fillna(1, inplace=True)) 1 0 RESUME 0 2 LOAD_GLOBAL 0 (df) 14 LOAD_CONST 1 ('a') 16 BINARY_SUBSCR 26 LOAD_METHOD 1 (fillna) 48 LOAD_CONST 2 (1) 50 LOAD_CONST 3 (True) 52 KW_NAMES 4 54 PRECALL 2 58 CALL 2 68 RETURN_VALUE ```

We actually noticed this different ref count for the standard case, and therefore use a different threshold for Python 3.11+:

https://github.com/pandas-dev/pandas/blob/400ae748367cf9a356b8433c1aa59f34033f02a4/pandas/compat/_constants.py#L21

But that means that for the args/kwargs use case, we incorrectly don't trigger a warning.

If we want, we probably would be able to inspect the stack to try to figure out whether the method was called with args/kwargs or not, if we are OK with paying the code complexity for this and the runtime overhead. Short-term we probably won't include this, but if it turns out to be a more common case, we can always try that in the future. I experimented a little bit with this, so posting here the code for posterity.

```python def _is_using_args_kwargs(code_object, method): # simple case of df.method(..) without args/kwargs for instr in dis.get_instructions(code_object): if instr.opname == "LOAD_METHOD" and instr.argval == method: print("we have LOAD_METHOD (fillna)") return False # fallback for more complex cases instructions = list(dis.get_instructions(code_object)) instr_call_ex = [i for i, instr in enumerate(instructions) if instr.opname == "CALL_FUNCTION_EX"] if not instr_call_ex: print("we have no CALL_FUNCTION_EX") return False method_with_ex = False for case in instr_call_ex: for i in range(case, 0, -1): if instructions[i].opname == "LOAD_ATTR": method_with_ex = instructions[i].argval == method if method_with_ex: print("we have CALL_FUNCTION_EX with prepending LOAD_ATTR (fillna)") break if method_with_ex: return True return method_with_ex ``` and this works as ``` In [6]: import dis In [7]: d = dis.Bytecode("df['a'].fillna(*args, inplace=True)") In [10]: _is_using_args_kwargs(d.codeobj, method="fillna") Out[10]: True In [11]: d = dis.Bytecode("df['a'].fillna(0, inplace=True)") In [12]: _is_using_args_kwargs(d.codeobj, method="fillna") Out[12]: False In [13]: d = dis.Bytecode("df['a'].fillna(0, inplace=True).reset_index(*args)") In [14]: _is_using_args_kwargs(d.codeobj, method="fillna") Out[14]: False ``` For the actual method, it could be integrated with something like: ```diff --- a/pandas/core/generic.py +++ b/pandas/core/generic.py @@ -6,6 +6,7 @@ from copy import deepcopy import datetime as dt from functools import partial import gc +import inspect from json import loads import operator import pickle @@ -7304,11 +7306,20 @@ class NDFrame(PandasObject, indexing.IndexingMixin): and self._is_view_after_cow_rules() ): ctr = sys.getrefcount(self) ref_count = REF_COUNT if isinstance(self, ABCSeries) and _check_cacher(self): # see https://github.com/pandas-dev/pandas/pull/56060#discussion_r1399245221 ref_count += 1 + if PY311: + calling_frame = inspect.currentframe().f_back + if _is_using_args_kwargs(calling_frame.f_code): + ref_count += 1 if ctr <= ref_count: warnings.warn( _chained_assignment_warning_method_msg, FutureWarning, ```