pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.14k stars 1.83k forks source link

Series.hist resulting series name changes with include_breakpoint=False, include_category=False #17279

Open ggggggggg opened 2 months ago

ggggggggg commented 2 months ago

Checks

Reproducible example

import polars as pl
import numpy as np
series = pl.Series([1,2,3,4,4,5,6,3,2,1]).rename("a")
h1 = series.hist(np.arange(10), include_category=False, include_breakpoint=False)
h2 = series.hist(np.arange(10), include_category=False, include_breakpoint=True)
print(f"{h1.columns=}")
print(f"{h2.columns=}")

Log output

No response

Issue description

The column containing the counts has different names in the two invocations of hist.

h1.columns=['a']
h2.columns=['break_point', 'count']

Expected behavior

I expect the column containing the count to be named "count" in both cases, or less preferably "a" in both cases. However it is called "a" when neither break_point not category are included.

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: 3.7.2 nest_asyncio: 1.5.7 numpy: 1.24.4 openpyxl: pandas: 2.1.0 pyarrow: 16.1.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: 2.0.20 torch: xlsx2csv: xlsxwriter: ```
stinodego commented 2 months ago

This is a side-effect of porting the method from Expr to Series.

For Expr, the resulting expression is named "a" in all cases, but when include_breakpoint=True the return data type is a Struct and the fields are named breakpoint/count. Because we expand that struct to a dataframe in the Series method, the original name is lost.

I think the best solution would be to just follow Expr here and return a Struct Series, not a DataFrame. That's how cut/qcut work at the moment. What would you think of that solution?

hist is sort-of modeled after cut/qcut, but we aim to overhaul that API so that is not very helpful either.

ggggggggg commented 2 months ago

Returning a struct series seems fine to me.