pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.68k stars 1.99k forks source link

to_dummies() as an Expr #9308

Open bernardodionisi opened 1 year ago

bernardodionisi commented 1 year ago

Problem description

Hi!

I was wondering if it would be possible to add .to_dummies() to the expression API.

An application of this feature would be defining design matrices before materializing them, I think the expression API can provide something close to an alternative way to a Wilkinson formula

Where, for example, "y ~ a + i(b)" (where i creates dummies for the categories in b) could be expressed as [pl.col("y"), pl.col("a"), pl.col("b").to_dummies()]

One would then also need a way to drop one of the dummies, to avoid multicollinearity issues.

Maybe this would not be possible given that .to_dummies() returns a DataFrame.

There are so many great features in the library, I hope I didn't miss a way to achieve this already.

Thank you for this amazing library!

mcrumiller commented 1 year ago

One of the issues that an expression output a single columns, so this isn't really feasible: if a column has N distinct values, then it has N or N-1 dummy columns.

bernardodionisi commented 1 year ago

Couldn't the dummies be returned as a single list or struct column? maybe not ideal though, and it would probably need a different name than to_dummies

mcrumiller commented 1 year ago

Yeah, good point--a struct would work.

magarick commented 1 year ago

The other option, that might be simpler, is adding a "keep" argument to the existing DataFrame function which would retain some columns as-is in the output. Maybe it should be a different function "design_matrix" or something, but same idea. I don't think it would make unnecessary copies even if you just concatenated the dummy df with as-is columns.

yanis-falaki commented 3 weeks ago

Commenting to boost.