Open Code0x58 opened 2 months ago
Thanks for the report. For efficiency, pandas implements skipna=True
by filling NA values with 0. I think to recreate a new array of smaller size would be prohibitively inefficient. If you'd like to be operable with pandas, I'd suggest implementing integer addition (at the very least, with the number 0).
That makes a fair bit of sense, and works out of the box for things like the built in complex
, Fraction
, and Decimal
builtins which are probably more representative of normal use, as the relevant identity (i.e. 0 or 1) is use.
In the case of operations that produce scalars, like .sum()
, .min()
, etc. I'd have thought with the cost of operating with python objects that omitting invocations with nulls would actually ended up a little more efficient than dropping in 1 or 0 and throwing that at the next encountered value. I suppose given behaviour on empty series for .min()
, .prod()
, and sum()
you'd want to have special handling everywhere NaN
, 1
, and 0
if everything else is unchanged, which seems fair. As a bit an extension, if you could provide the identity value and it supported __iadd__
or whatever else, you could likely end up with somewhat better efficiency as you can reasonably claim ownership over a provided identity
parameter (or magic method?) and avoid creating intermediate objects.
I can see what you mean for efficiency of the expanding operations like .cumsum()
, .cumin()
, etc. as you'd need to either copy the object, or more precisely recreate an identity to accumulate to produce a new value. The puritan in me likes the idea of this over "you just get some default int", maybe with an interface like series.cumsum(identity_factory=X)
.
As much as I think there could be some nice performance gains, and maybe a little bit less surprise with documentation as it stands, this particularly niche case can be solved by handling NaN
, 0
, and 1
as you say. The "surprise" for the very odd user could be ironed out with documentation explaining what happens with the object dtype along with examples of handling those special cases.
That all said, as interesting as it I've found it to talk about, I don't have any real interest in promoting or working on any sort of "scalars might go faster and explicit identities for in-place ops may be faster still" or "explicit identities allow another way to do expanding". Even updating documentation feels like a bit of a stretch now as I guess this doesn't come up much, as even my case just came up from trying to hack up something into an existing project so have plenty of flexibility.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
aggregation methods over objects does not respect skipna. Exception produced above
Expected Behavior
NaN values are skipped.
Given that all-NaN values are documented to produce NaN
Installed Versions