pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

The result of drop_first is not unique #14832

Open wukan1986 opened 6 months ago

wukan1986 commented 6 months ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame(
    {
        "foo": [1, 2],
        "bar": [3, 4],
        "ham": ["a", "b"],
    }
)
df1 = df.to_dummies('foo', drop_first=True)
df2 = df.reverse().to_dummies('foo', drop_first=True)
print(df1.columns)
print(df2.columns)
"""
['foo_2', 'bar', 'ham']
['foo_1', 'bar', 'ham']
"""

Log output

No response

Issue description

The result of drop_first is not unique

If it is a large table, do not make any changes to the table. The columns obtained each time may also be different.

Expected behavior

I hope first category from the variables being encoded not affected by order

Installed versions

``` --------Version info--------- Polars: 0.20.13 Index type: UInt32 Platform: Windows-10-10.0.22631-SP0 Python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:29:11) [MSC v.1935 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fsspec: 2023.10.0 gevent: hvplot: 0.9.1 matplotlib: 3.8.2 numpy: 1.26.1 openpyxl: pandas: 2.1.3 pyarrow: 14.0.1 pydantic: 2.6.0 pyiceberg: pyxlsb: sqlalchemy: 2.0.23 xlsx2csv: xlsxwriter: ```
s-banach commented 6 months ago

Definitely not a bug for general dtypes, but maybe drop_first should be guaranteed to drop the first level of an Enum.

Also, maybe Enum -> to_dummies() should create column for every level of the Enum, even if they are not present in the dataset. Thus to_dummies() will always generate the same schema as output, regardless of the input data.

wukan1986 commented 6 months ago

Maybe sort the new columns alphabetically and remove the first one