pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

BUG: Inconsistency between behavior of cumsum when applied directly, and when applied within groupby. #44009

Open floccinauc opened 3 years ago

floccinauc commented 3 years ago

Reproducible Example

import pandas as pd
df = pd.DataFrame({"A": [2, 1, 2, 2],
                       "B": [3, 3, 4, 4],                       
                       "E": [['10'], ['20'], ['30'], ['40']]})
df["A"].cumsum()
df["E"].cumsum()

df.groupby("B")["A"].cumsum()
df.groupby("B")["E"].cumsum()

Issue Description

Inconsistency between behavior of cumsum() when applied directly, and when applied within groupby(). When used directly on a column of a dataframe that contains lists, cumsum() progressively concatenates the lists. When used as part of groupby() on the same dataframe cumsum() throws "NotImplementedError: function is not implemented for this dtype: [how->cumsum,dtype->object]"

Expected Behavior

I'd expect cumsum() to progressively concatenate the lists of all rows within each group defined by groupby(), in a way similar to what it does with numeric values.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 73c68257545b5f8530b7044f56647bd2db92e2ba python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-147-generic Version : #151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.3 numpy : 1.19.2 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 46.1.3 Cython : 0.29.21 pytest : 6.2.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.9 lxml.etree : 4.5.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.16.1 pandas_datareader: None bs4 : 4.9.1 bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.2.2 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.16.0 pyxlsb : None s3fs : None scipy : 1.5.0 sqlalchemy : 1.3.18 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : None numba : 0.53.1
mzeitlin11 commented 3 years ago

Thanks for the report @floccinauc! Related #29033