pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

BUG/inconsistency: pd.Series(dtype=object).sum(skipna=True) does not respect skipna=True #59764

Open Code0x58 opened 2 months ago

Code0x58 commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np

class X:
    def __init__(self, n, generation=0):
        self.n = n
        self.generation = generation

    def __add__(self, other):
        return X(self.n + other.n, max(self.generation, other.generation) + 1)

    def __mul__(self, other):
        return X(self.n * other.n, max(self.generation, other.generation) + 1)

    def __repr__(self):
        return f'X({self.n}, {self.generation})'

s = pd.Series([X(1), X(2), np.nan, X(0)])

# these implicitly have skipna=True anyway
s.sum(skipna=True)

Issue Description

aggregation methods over objects does not respect skipna. Exception produced above

# sum, cumsum, prod, cumprod, ...
AttributeError: 'int' object has no attribute 'n'

Expected Behavior

NaN values are skipped.

Given that all-NaN values are documented to produce NaN

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.5.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-213-generic Version : #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : C.UTF-8 pandas : 2.2.2 numpy : 2.1.1 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : None pip : 24.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None ```
rhshadrach commented 2 months ago

Thanks for the report. For efficiency, pandas implements skipna=True by filling NA values with 0. I think to recreate a new array of smaller size would be prohibitively inefficient. If you'd like to be operable with pandas, I'd suggest implementing integer addition (at the very least, with the number 0).

Code0x58 commented 2 months ago

That makes a fair bit of sense, and works out of the box for things like the built in complex, Fraction, and Decimal builtins which are probably more representative of normal use, as the relevant identity (i.e. 0 or 1) is use.

In the case of operations that produce scalars, like .sum(), .min(), etc. I'd have thought with the cost of operating with python objects that omitting invocations with nulls would actually ended up a little more efficient than dropping in 1 or 0 and throwing that at the next encountered value. I suppose given behaviour on empty series for .min(), .prod(), and sum() you'd want to have special handling everywhere NaN, 1, and 0 if everything else is unchanged, which seems fair. As a bit an extension, if you could provide the identity value and it supported __iadd__ or whatever else, you could likely end up with somewhat better efficiency as you can reasonably claim ownership over a provided identity parameter (or magic method?) and avoid creating intermediate objects.

I can see what you mean for efficiency of the expanding operations like .cumsum(), .cumin(), etc. as you'd need to either copy the object, or more precisely recreate an identity to accumulate to produce a new value. The puritan in me likes the idea of this over "you just get some default int", maybe with an interface like series.cumsum(identity_factory=X).

As much as I think there could be some nice performance gains, and maybe a little bit less surprise with documentation as it stands, this particularly niche case can be solved by handling NaN, 0, and 1 as you say. The "surprise" for the very odd user could be ironed out with documentation explaining what happens with the object dtype along with examples of handling those special cases.

That all said, as interesting as it I've found it to talk about, I don't have any real interest in promoting or working on any sort of "scalars might go faster and explicit identities for in-place ops may be faster still" or "explicit identities allow another way to do expanding". Even updating documentation feels like a bit of a stretch now as I guess this doesn't come up much, as even my case just came up from trying to hack up something into an existing project so have plenty of flexibility.