pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.89k stars 1.92k forks source link

Swapping rename without optimization produce incorrect schema. #10652

Open reswqa opened 1 year ago

reswqa commented 1 year ago

Checks

Reproducible example

        let lf = df!["a" => [1,2,3], "b" => [4,5,6]]?;
        let renamed = lf
            .lazy()
            .rename(["a", "b"], ["b", "a"])
            .select([col("a")])
            .without_optimizations();
        dbg!(&renamed.collect());

Issue description

It will touch the unexpected branch if panic_on_schema feature was activated(Of course, we only enable it for test env):

https://github.com/pola-rs/polars/blob/bc166cef669f0e783acae4dd6f703e8605659251/crates/polars-lazy/src/physical_plan/expressions/column.rs#L61-L75

This will not affect product env as it will fall back to the linear search branch in that case:

https://github.com/pola-rs/polars/blob/bc166cef669f0e783acae4dd6f703e8605659251/crates/polars-lazy/src/physical_plan/expressions/column.rs#L76-L80

The inconsistent order in the schema may originate from the following code:

https://github.com/pola-rs/polars/blob/bc166cef669f0e783acae4dd6f703e8605659251/crates/polars-plan/src/logical_plan/functions/rename.rs#L31-L39

Expected behavior

Panic path should not happen.

Installed versions

Replace this line with a list of feature gates
mcrumiller commented 1 year ago

Hi @reswqa, this is already fixed on main as per #10624.

reswqa commented 1 year ago

The difference is that optimization is not enabled here. Due to the projection pushdown rule, they will touch different branches.

ritchie46 commented 1 year ago

@reswqa good analysis :+1: . Care to take this one?

reswqa commented 1 year ago

Care to take this one?

Yes, I'll see if there's any way to fix it.

ritchie46 commented 1 year ago

I think that only in the case the swapping flag is set, we must do something different.

reswqa commented 1 year ago

Yes, I spent a short time thinking about it before, but I haven't come up with a way not to increase the time complexity yet, need to take a closer look.

reswqa commented 1 year ago

Perhaps for swapping a -> b, let's first perform a -> POLARS_RENAME_COLUMN_b and b -> POLARS_RENAME_COLUMN_a and handle it according to the non swap situation. Finally, proceed with POLARS_RENAME_COLUMN_xxx -> xxx 🤔 This should be all O(1) operations.