pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.15k stars 1.94k forks source link

Inconsistent numeric column detection in lazy DataFrame after transformation #19226

Open lucianolorenti opened 3 weeks ago

lucianolorenti commented 3 weeks ago

Checks

Reproducible example

import polars as pl
import polars.selectors as cs

def get_numeric_columns(df):
    return df.select(cs.numeric()).collect_schema().names()

def get_columns(df):
    return df.collect_schema().names()

def process_column(df):
    columns = ["B", "C"]
    for c in columns:
        df = df.with_columns(
            pl.when(pl.col(c) < 0.3)
            .then(0.3)
            .otherwise(pl.col(c))
            .alias(c)
        )
    return df

df = pl.DataFrame(
    {
        "A": ["a", "b", "c", "d", "e"],
        "B": [1, 2, 3, 4, 5],
        "C": [0.2, 0.1, 0.3, 0.5, 0.4],
    }
)

df_lazy = df.lazy()

assert len(get_columns(df))  == 3
assert len(get_columns(df_lazy))  == 3

assert len(get_numeric_columns(df))  == 2
assert len(get_numeric_columns(df_lazy))  == 2

df = process_column(df)
df_lazy = process_column(df_lazy)

assert len(get_columns(df))  == 3
assert len(get_columns(df_lazy))  == 3

assert len(get_numeric_columns(df))  == 2
assert len(get_numeric_columns(df_lazy))  == 2

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[9], line 46
     42 assert len(get_columns(df_lazy))  == 3
     45 assert len(get_numeric_columns(df))  == 2
---> 46 assert len(get_numeric_columns(df_lazy))  == 2

AssertionError:

Log output

No response

Issue description

After performing a column transformation on a lazy DataFrame, when I selected the numeric columns, the number of columns returned is not the same when using LazyDataframe with respect to eaget dataframe. I expect having 2 numeric columns in both cases.

Its seems that the column B is not guessed as numeric according to the plan.

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

 WITH_COLUMNS:
 [when([(col("C")) < (0.3)]).then(0.3).otherwise(col("C")).alias("C")]
   WITH_COLUMNS:
   [when([(col("B").cast(Unknown(Float))) < (dyn float: 0.3)]).then(dyn float: 0.3).otherwise(col("B").strict_cast(Unknown(Float))).alias("B")]
    DF ["A", "B", "C"]; PROJECT */3 COLUMNS; SELECTION: None

Expected behavior

After performing the column transformation, I was expecting to obtain the same columns both in eager and lazy DataFrames.

Maybe the current one is the expected behaviour, but I am not sure. I tried to look for a similar report but without any luck.

Installed versions

``` --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: Windows-10-10.0.22631-SP0 Python: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager altair 5.0.1 cloudpickle 2.2.1 connectorx deltalake fastexcel fsspec 2023.1.0 gevent great_tables matplotlib 3.8.1 nest_asyncio 1.5.6 numpy 1.24.0 openpyxl 3.1.2 pandas 1.5.3 pyarrow 11.0.0 pydantic 2.7.1 pyiceberg sqlalchemy 1.4.46 torch 2.0.1+cpu xlsx2csv xlsxwriter 3.0.8 ```
cmdlineluser commented 3 weeks ago

Can reproduce.

import polars as pl
import polars.selectors as cs

df = pl.LazyFrame({"A": [1]})

df.select(
    pl.when(pl.col.A < 1).then(pl.col.A).otherwise(2)   # <- OK: int / int
).select(cs.numeric()).collect_schema()
# Schema([('A', Int64)])

df.select(
    pl.when(pl.col.A < 1).then(pl.col.A).otherwise(2.0) # <- NOT OK: int / float
).select(cs.numeric()).collect_schema()
# Schema()

The Schema says it's a Float64:

df.select(
    pl.when(pl.col.A < 1).then(pl.col.A).otherwise(2.0)
).collect_schema()
# Schema([('A', Float64)])

But even a regular dtype selection does not work:

df.select(
    pl.when(pl.col.A < 1).then(pl.col.A).otherwise(2.0)
).select(pl.col(pl.Float64)).collect()
# shape: (0, 0)
# ┌┐
# ╞╡
# └┘
ritchie46 commented 3 weeks ago

Ah, selectors don't recognize dynamic numerics I think.