Open Wainberg opened 10 months ago
You might want to consider contributing it to https://github.com/abstractqqq/polars_ds_extension
I was thinking it might be a better fit for polars itself because it's just about making an existing function "work" on new inputs (lists) in the obvious way.
in which each row of the dummy dataframe has multiple elements turned on (one for each list element) instead of one.
This is a bit inconsistent with the behavior of df.to_dummies
for non-nested type. I admit that this is more valuable than treating sub-list as an one element, but it seems that we should let it belong to the ListNameSpace
rather than breaking the consistency of to_dummies
. Also, I'm not sure whether to include it in polars or in a plugin, that needs to be discussed.
You're thinking Expr.list.to_dummies()
, and then regular to_dummies
still gives an error?
Yes, I think regular to_dummies
should raise for list
type, at least for list with non-numeric inner type(It depend on group-by list
, but we only support it for numeric inner type). But this is only a temporary state and will be well supported once we implement Row Encoding. list.to_dummies
, on the other hand, obviously don't need to group the list. I just want the behavior of regular to_dummies
to be consistent for any type. Also this is not final, we can still negotiate. :)
Description
I'd like to implement "multi-hot" encoding for List columns in df.to_dummies, in which each row of the dummy dataframe has multiple elements turned on (one for each list element) instead of one. This is a common pre-processing step for statistical and machine learning applications.
So whereas this currently gives an error:
It could instead give: