pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.7k stars 2k forks source link

Support "multi-hot" encoding of List columns in df.to_dummies #13733

Open Wainberg opened 10 months ago

Wainberg commented 10 months ago

Description

I'd like to implement "multi-hot" encoding for List columns in df.to_dummies, in which each row of the dummy dataframe has multiple elements turned on (one for each list element) instead of one. This is a common pre-processing step for statistical and machine learning applications.

So whereas this currently gives an error:

>>> pl.DataFrame({'a': [['A', 'B'], ['B'], ['A', 'C']]}).to_dummies()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/dataframe/frame.py", line 9033, in to_dummies
    return self._from_pydf(self._df.to_dummies(columns, separator, drop_first))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: grouping on list type is only allowed if the inner type is numeric

It could instead give:

>>> pl.DataFrame({'a': [['A', 'B'], ['B'], ['A', 'C']]}).to_dummies()
shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ u8  ┆ u8  ┆ u8  │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 0   │
│ 0   ┆ 1   ┆ 0   │
│ 1   ┆ 0   ┆ 1   │
└─────┴─────┴─────┘
kszlim commented 10 months ago

You might want to consider contributing it to https://github.com/abstractqqq/polars_ds_extension

Wainberg commented 10 months ago

I was thinking it might be a better fit for polars itself because it's just about making an existing function "work" on new inputs (lists) in the obvious way.

reswqa commented 10 months ago

in which each row of the dummy dataframe has multiple elements turned on (one for each list element) instead of one.

This is a bit inconsistent with the behavior of df.to_dummies for non-nested type. I admit that this is more valuable than treating sub-list as an one element, but it seems that we should let it belong to the ListNameSpace rather than breaking the consistency of to_dummies. Also, I'm not sure whether to include it in polars or in a plugin, that needs to be discussed.

Wainberg commented 10 months ago

You're thinking Expr.list.to_dummies(), and then regular to_dummies still gives an error?

reswqa commented 10 months ago

Yes, I think regular to_dummies should raise for list type, at least for list with non-numeric inner type(It depend on group-by list, but we only support it for numeric inner type). But this is only a temporary state and will be well supported once we implement Row Encoding. list.to_dummies, on the other hand, obviously don't need to group the list. I just want the behavior of regular to_dummies to be consistent for any type. Also this is not final, we can still negotiate. :)