Open bernardodionisi opened 1 year ago
One of the issues that an expression output a single columns, so this isn't really feasible: if a column has N distinct values, then it has N or N-1 dummy columns.
Couldn't the dummies be returned as a single list
or struct
column? maybe not ideal though, and it would probably need a different name than to_dummies
Yeah, good point--a struct would work.
The other option, that might be simpler, is adding a "keep" argument to the existing DataFrame function which would retain some columns as-is in the output. Maybe it should be a different function "design_matrix" or something, but same idea. I don't think it would make unnecessary copies even if you just concatenated the dummy df with as-is columns.
Commenting to boost.
Problem description
Hi!
I was wondering if it would be possible to add
.to_dummies()
to the expression API.An application of this feature would be defining design matrices before materializing them, I think the expression API can provide something close to an alternative way to a Wilkinson formula
Where, for example,
"y ~ a + i(b)"
(wherei
creates dummies for the categories in b) could be expressed as[pl.col("y"), pl.col("a"), pl.col("b").to_dummies()]
One would then also need a way to drop one of the dummies, to avoid multicollinearity issues.
Maybe this would not be possible given that
.to_dummies()
returns aDataFrame
.There are so many great features in the library, I hope I didn't miss a way to achieve this already.
Thank you for this amazing library!