Open CangyuanLi opened 5 months ago
I believe Pandas' approach is more appropriate. If a row is entirely None, it should return None.
If you wish to achieve the same result as the +
operator, you can set fill_value=None
.
import pandas as pd
df = pd.DataFrame(
{
"a": [1, 8, None],
"b": [4, None, None],
}
)
df['add'] = df["a"].add(df["b"], fill_value=0)
print(df)
"""
a b add
0 1.0 4.0 5.0
1 8.0 NaN 8.0
2 NaN NaN NaN
"""
df['add'] = df["a"].add(df["b"], fill_value=None)
print(df)
"""
a b add
0 1.0 4.0 5.0
1 8.0 NaN NaN
2 NaN NaN NaN
"""
In the case of sum
as an aggregation function, the idea is that the sum of no elements is 0. In particular, if all your elements are null, the sum of the non-null elements is 0. For example
pl.Series([], dtype=pl.Int32).sum()
is 0.pl.Series([None], dtype=pl.Int32).sum()
is 0.The question is whether sum_horizontal
should have the same behavior as the vertical sum.
EDIT: Of course aggregation functions like sum
should always have the argument ignore_nulls
. As soon as I saw cmdlineuser's comment, this became clear.
The frame method has ignore_nulls
>>> df.sum_horizontal(ignore_nulls=False)
shape: (3,)
Series: 'sum' [i64]
[
5
null
null
]
But pl.sum_horizontal
doesn't allow you to specify that:
Pandas add
operates differently, as it aligns columns first, similar to polars' df.join(how="align")
. See #9804 for a little more detail (also requested in #10390). I had a PR a long time ago (#9805) that was ultimated rejected in favor of performing the join + add + fill_null.
In the case of
sum
as an aggregation function, the idea is that the sum of no elements is 0. In particular, if all your elements are null, the sum of the non-null elements is 0. For example
pl.Series([], dtype=pl.Int32).sum()
is 0.pl.Series([None], dtype=pl.Int32).sum()
is 0.The question is whether
sum_horizontal
should have the same behavior as the vertical sum.EDIT: Of course aggregation functions like
sum
should always have the argumentignore_nulls
. As soon as I saw cmdlineuser's comment, this become clear.
I think that it might make sense for the vertical sum to be different from sum_horizontal, but I would have expected for .add(), which adds two series, to have the same behavior as sum_horizontal (or the other way around).
I really like the behavior @wukan1986 mentioned; it would be awesome if that could be added!
Checks
Reproducible example
Log output
No response
Issue description
I am not sure if this strictly as a bug, as the behavior might be intended, but the current behavior is (to me at least) unintuitive. Using the .add() method (or the
+
operator) propagates nulls, i.e. None + 1 = None. Using pl.sum_horizontal() treats nulls as 0, i.e. None + None = 0.Expected behavior
I would expect pl.sum_horizontal() and repeated
+
operations to return the same thing.Installed versions