pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

pl.sum_horizontal handles nulls differently from .add() #16200

Open CangyuanLi opened 5 months ago

CangyuanLi commented 5 months ago

Checks

Reproducible example

df = pl.DataFrame(
    {
        "a": [1, 8, None],
        "b": [4, None, None],
    }
)
df.with_columns(
    pl.sum_horizontal("a", "b").alias("sum_horizontal"),
    pl.col("a").add(pl.col("b")).alias("add")
)

shape: (3, 4)
┌──────┬──────┬────────────────┬──────┐
│ a    ┆ b    ┆ sum_horizontal ┆ add  │
│ ---  ┆ ---  ┆ ---            ┆ ---  │
│ i64  ┆ i64  ┆ i64            ┆ i64  │
╞══════╪══════╪════════════════╪══════╡
│ 1    ┆ 4    ┆ 5              ┆ 5    │
│ 8    ┆ null ┆ 8              ┆ null │
│ null ┆ null ┆ 0              ┆ null │
└──────┴──────┴────────────────┴──────┘

Log output

No response

Issue description

I am not sure if this strictly as a bug, as the behavior might be intended, but the current behavior is (to me at least) unintuitive. Using the .add() method (or the + operator) propagates nulls, i.e. None + 1 = None. Using pl.sum_horizontal() treats nulls as 0, i.e. None + None = 0.

Expected behavior

I would expect pl.sum_horizontal() and repeated + operations to return the same thing.

Installed versions

``` --------Version info--------- Polars: 0.20.25 Index type: UInt32 Platform: Linux-4.18.0-372.71.1.el8_6.x86_64-x86_64-with-glibc2.28 Python: 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.1.0 connectorx: deltalake: fastexcel: fsspec: 2021.10.1 gevent: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.5.5 numpy: 1.23.1 openpyxl: 3.0.10 pandas: 2.2.2 pyarrow: 16.0.0 pydantic: 2.7.1 pyiceberg: pyxlsb: sqlalchemy: 1.4.39 torch: 1.12.1 xlsx2csv: xlsxwriter: ```
wukan1986 commented 5 months ago

I believe Pandas' approach is more appropriate. If a row is entirely None, it should return None. If you wish to achieve the same result as the + operator, you can set fill_value=None.

import pandas as pd

df = pd.DataFrame(
    {
        "a": [1, 8, None],
        "b": [4, None, None],
    }
)
df['add'] = df["a"].add(df["b"], fill_value=0)
print(df)
"""
     a    b  add
0  1.0  4.0  5.0
1  8.0  NaN  8.0
2  NaN  NaN  NaN
"""
df['add'] = df["a"].add(df["b"], fill_value=None)
print(df)
"""
     a    b  add
0  1.0  4.0  5.0
1  8.0  NaN  NaN
2  NaN  NaN  NaN
"""
s-banach commented 5 months ago

In the case of sum as an aggregation function, the idea is that the sum of no elements is 0. In particular, if all your elements are null, the sum of the non-null elements is 0. For example

The question is whether sum_horizontal should have the same behavior as the vertical sum.

EDIT: Of course aggregation functions like sum should always have the argument ignore_nulls. As soon as I saw cmdlineuser's comment, this became clear.

cmdlineluser commented 5 months ago

The frame method has ignore_nulls

>>> df.sum_horizontal(ignore_nulls=False)
shape: (3,)
Series: 'sum' [i64]
[
    5
    null
    null
]

But pl.sum_horizontal doesn't allow you to specify that:

https://github.com/pola-rs/polars/blob/07d58ab4feb9c2a9b4fcb84d0b5808f87fc90a3f/crates/polars-ops/src/series/ops/horizontal.rs#L18

mcrumiller commented 5 months ago

Pandas add operates differently, as it aligns columns first, similar to polars' df.join(how="align"). See #9804 for a little more detail (also requested in #10390). I had a PR a long time ago (#9805) that was ultimated rejected in favor of performing the join + add + fill_null.

CangyuanLi commented 5 months ago

In the case of sum as an aggregation function, the idea is that the sum of no elements is 0. In particular, if all your elements are null, the sum of the non-null elements is 0. For example

  • pl.Series([], dtype=pl.Int32).sum() is 0.
  • pl.Series([None], dtype=pl.Int32).sum() is 0.

The question is whether sum_horizontal should have the same behavior as the vertical sum.

EDIT: Of course aggregation functions like sum should always have the argument ignore_nulls. As soon as I saw cmdlineuser's comment, this become clear.

I think that it might make sense for the vertical sum to be different from sum_horizontal, but I would have expected for .add(), which adds two series, to have the same behavior as sum_horizontal (or the other way around).

I really like the behavior @wukan1986 mentioned; it would be awesome if that could be added!