pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.22k stars 1.84k forks source link

polars.to_dummies() does not allow for a proper number of dummy variables #8046

Open kcpGOAT opened 1 year ago

kcpGOAT commented 1 year ago

Problem description

Normally, when one is encoding a categorical variable, one would have k - 1 dummy variables for k levels. This decision is made to avoid the issue of multicollinearity when building a regression model. For instance, if the variable were sex and the levels were male and female, then you would only need a single "sex" column.

In Pandas, this feature is implemented with the "drop_first" parameter of the panda.get_dummies() function. However, I see no such feature in the Polars equivalent.

mhoirup commented 1 year ago

You could just drop the unnecessary column. This would actually give you more/easier control over who you'd keep as a control group in an inference context.