pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

Keep order of categories/strings when converting to dummies #12248

Open matquant14 opened 11 months ago

matquant14 commented 11 months ago

Description

I have a DataFrame with 10 categories. I want to one hot encode them, making a 10x10, but when using the to_dummies function, it appears to sort the columns according to alphabetical order. I'd like to maintain the order, essentially creating a 10x10 identity matrix, but w/ the column names reflecting the original order of the categories. Here's an example of what I'm experiencing

categories = ['Ethical and Professional Standards', 'Quantitative Methods', 'Economics', 'Financial Statement Analysis',
              'Corporate Issuers', 'Portfolio Management', 'Equity Investments', 'Fixed Income', 'Derivatives',
              'Alternative Investments']

cat_df = pl.DataFrame({'categories': categories}, schema = {'categories': pl.Categorical})
cat_to_dum = cat_df.to_dummies(separator = ':')
cat_to_dum.rename({col: col.replace('categories:', '') for col in cat_to_dum.columns})

+-------------------------+-------------------+-------------+-----------+--------------------+------------------------------------+------------------------------+--------------+----------------------+----------------------+
| Alternative Investments | Corporate Issuers | Derivatives | Economics | Equity Investments | Ethical and Professional Standards | Financial Statement Analysis | Fixed Income | Portfolio Management | Quantitative Methods |
+-------------------------+-------------------+-------------+-----------+--------------------+------------------------------------+------------------------------+--------------+----------------------+----------------------+
|            0            |         0         |      0      |     0     |         0          |                 1                  |              0               |      0       |          0           |          0           |
|            0            |         0         |      0      |     0     |         0          |                 0                  |              0               |      0       |          0           |          1           |
|            0            |         0         |      0      |     1     |         0          |                 0                  |              0               |      0       |          0           |          0           |
|            0            |         0         |      0      |     0     |         0          |                 0                  |              1               |      0       |          0           |          0           |
|            0            |         1         |      0      |     0     |         0          |                 0                  |              0               |      0       |          0           |          0           |
|            0            |         0         |      0      |     0     |         0          |                 0                  |              0               |      0       |          1           |          0           |
|            0            |         0         |      0      |     0     |         1          |                 0                  |              0               |      0       |          0           |          0           |
|            0            |         0         |      0      |     0     |         0          |                 0                  |              0               |      1       |          0           |          0           |
|            0            |         0         |      1      |     0     |         0          |                 0                  |              0               |      0       |          0           |          0           |
|            1            |         0         |      0      |     0     |         0          |                 0                  |              0               |      0       |          0           |          0           |
+-------------------------+-------------------+-------------+-----------+--------------------+------------------------------------+------------------------------+--------------+----------------------+----------------------+

to get what I want I have to add a select expression

cat_to_dum.rename({col:col.replace('categories:','') for col in cat_to_dum.columns}).select(pl.col(categories))

+------------------------------------+----------------------+-----------+------------------------------+-------------------+----------------------+--------------------+--------------+-------------+-------------------------+
| Ethical and Professional Standards | Quantitative Methods | Economics | Financial Statement Analysis | Corporate Issuers | Portfolio Management | Equity Investments | Fixed Income | Derivatives | Alternative Investments |
+------------------------------------+----------------------+-----------+------------------------------+-------------------+----------------------+--------------------+--------------+-------------+-------------------------+
|                 1                  |          0           |     0     |              0               |         0         |          0           |         0          |      0       |      0      |            0            |
|                 0                  |          1           |     0     |              0               |         0         |          0           |         0          |      0       |      0      |            0            |
|                 0                  |          0           |     1     |              0               |         0         |          0           |         0          |      0       |      0      |            0            |
|                 0                  |          0           |     0     |              1               |         0         |          0           |         0          |      0       |      0      |            0            |
|                 0                  |          0           |     0     |              0               |         1         |          0           |         0          |      0       |      0      |            0            |
|                 0                  |          0           |     0     |              0               |         0         |          1           |         0          |      0       |      0      |            0            |
|                 0                  |          0           |     0     |              0               |         0         |          0           |         1          |      0       |      0      |            0            |
|                 0                  |          0           |     0     |              0               |         0         |          0           |         0          |      1       |      0      |            0            |
|                 0                  |          0           |     0     |              0               |         0         |          0           |         0          |      0       |      1      |            0            |
|                 0                  |          0           |     0     |              0               |         0         |          0           |         0          |      0       |      0      |            1            |
+------------------------------------+----------------------+-----------+------------------------------+-------------------+----------------------+--------------------+--------------+-------------+-------------------------+

Can an argument be added to the to_dummies function that maintains the category (or string) order?

cmdlineluser commented 11 months ago

Looks like there is an explicit sort_by

pl.Series(list("defza")).to_dummies(separator="").columns
# ['a', 'd', 'e', 'f', 'z']

https://github.com/pola-rs/polars/blob/4a58499bdf97203c940a3c82c1113905e8c6087d/crates/polars-ops/src/series/ops/to_dummies.rs#L51

https://github.com/pola-rs/polars/blob/4a58499bdf97203c940a3c82c1113905e8c6087d/crates/polars-ops/src/series/ops/to_dummies.rs#L82-L85

Is sorting by default the more expected option for the usage of .get_dummies()?

It seems maintain_order=True|False (group_by) and sort=True|False (value_counts) currently exist.