pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.86k stars 18.01k forks source link

BUG: Pandas 1.0.5 → 1.1.0 behavior change on DataFrame.apply() where func returns tuple #35518

Open dechamps opened 4 years ago

dechamps commented 4 years ago

Code Sample, a copy-pastable example

print(
  pd.DataFrame([['orig1', 'orig2']])
  .apply(func=lambda col: ('new1', 'new2')))

Output of Pandas 1.0.5

0    (new1, new2)
1    (new1, new2)
dtype: object

Output of Pandas 1.1.0

      0     1
0  new1  new1
1  new2  new2

It is not clear to me if this behaviour change is intended or not. I couldn't find anything obvious in the release notes.

Possibly related: #35517, #34909 @simonjayhawkins @jbrockmendel

This broke my code, which is actively relying on tuples being treated as scalars and stored as single objects (instead of being laid across the dataframe).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : d9fff2792bf16178d4e450fe7384244e50635733 python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 4.19.104+ Version : #1 SMP Wed Feb 19 05:26:34 PST 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.0 numpy : 1.19.1 pytz : 2018.9 dateutil : 2.8.1 pip : 19.3.1 setuptools : 49.2.0 Cython : 0.29.21 pytest : 3.6.4 hypothesis : None sphinx : 1.8.5 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.2.6 html5lib : 1.0.1 pymysql : None psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64) jinja2 : 2.11.2 IPython : 5.5.0 pandas_datareader: None bs4 : 4.6.3 bottleneck : 1.3.2 fsspec : 0.7.4 fastparquet : None gcsfs : None matplotlib : 3.2.2 numexpr : 2.7.1 odfpy : None openpyxl : 2.5.9 pandas_gbq : 0.11.0 pyarrow : 0.14.1 pytables : None pyxlsb : None s3fs : 0.4.2 scipy : 1.4.1 sqlalchemy : 1.3.18 tables : 3.4.4 tabulate : 0.8.7 xarray : 0.15.1 xlrd : 1.1.0 xlwt : 1.3.0 numba : 0.48.0
jbrockmendel commented 4 years ago

which is actively relying on tuples being treated as scalars and stored as single objects

If you have a viable way to avoid this in your code, I'd encourage you to use it. Regardless of how this issue is addressed, tuples-as-scalars is fragile

dechamps commented 4 years ago

If you have a viable way to avoid this in your code, I'd encourage you to use it. Regardless of how this issue is addressed, tuples-as-scalars is fragile

Yep. Well at least this issue forced me to clean up my code :) I'm now wrapping the value inside a fully opaque container object.

simonjayhawkins commented 4 years ago

moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline

simonjayhawkins commented 4 years ago

moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline

simonjayhawkins commented 4 years ago

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

jorisvandenbossche commented 4 years ago

According to the docstring, I would say that the behaviour of 1.0.5 was correct, and this is a regression.

@jbrockmendel would you have time to look into it?

simonjayhawkins commented 3 years ago

According to the docstring

just to be clear, in the DataFrame.apply docstring https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html, the description for the result_type parameter is...

The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.

The return type for user function in the OP is a tuple (considered list-like) so we expect a Series of those.

This issue also occurs with a very list-like list where we also expect the default result_type behaviour to be to reduce.

>>> pd.__version__
'1.3.0.dev0+100.g54682234e3'
>>>
>>> df = pd.DataFrame([["orig1", "orig2"]])
>>>
>>> df.apply(func=lambda col: ("new1", "new2"), result_type="reduce")
0    (new1, new2)
1    (new1, new2)
dtype: object
>>>
>>> df.apply(func=lambda col: ("new1", "new2"))
      0     1
0  new1  new1
1  new2  new2
>>>
>>> df.apply(func=lambda col: ["new1", "new2"], result_type="reduce")
0    [new1, new2]
1    [new1, new2]
dtype: object
>>>
>>> df.apply(func=lambda col: ["new1", "new2"])
      0     1
0  new1  new1
1  new2  new2
>>>

I would say that the behaviour of 1.0.5 was correct, and this is a regression.

agreed.

@jbrockmendel would you have time to look into it?

ping

simonjayhawkins commented 3 years ago

Possibly related: #35517, #34909 @simonjayhawkins @jbrockmendel

can confirm, first bad commit: [91802a9ae400830f9eaadd395f6a9b40cdd92ee5] PERF: avoid creating many Series in apply_standard (#34909)

jbrockmendel commented 3 years ago

Aside from reverting #34909, the solution that comes to mind is calling the function on the first row in wrap_results_for_axis and seeing if we get a tuple. That runs into other problems with non-univalent or mutating functions.

simonjayhawkins commented 3 years ago

removing milestone