Open thomasaarholt opened 2 years ago
I will add that it's currently a bit difficult to use sklearn-type encoding algorithms with polars:
from sklearn.preprocessing import OrdinalEncoder
import polars as pl
import numpy as np
df = pl.DataFrame({"to_encode": ["a", "b", None], "some_value": [1, 2, 3]})
enc = OrdinalEncoder(dtype=np.int8)
df_encode = df.select("to_encode")
data = df_encode.to_numpy()
transformed = enc.fit_transform(data)
df_transformed = pl.DataFrame(
{col + "2": transformed[:, i] for i, col in enumerate(df_encode.columns)}
)
df.hstack(df_transformed)
shape: (3, 3)
┌───────────┬────────────┬────────────┐
│ to_encode ┆ some_value ┆ to_encode2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i8 │
╞═══════════╪════════════╪════════════╡
│ a ┆ 1 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ 3 ┆ 2 │
└───────────┴────────────┴────────────┘
You can use,
pl.col("to_encode").cast(pl.Categorical).to_physical().cast(pl.UInt8)
@ghuls answer is perfect. Our local categoricals are exactly ordinal encodings.
This is super neat! Thank you!
The only issue I have here is that the encoded value is dependent on the order in which it appears in the dataset - while the hashed value is always reproducible.
Imagine if I first treat my "train" dataset in the following way. Here "albert"
is encoded to 0
. If I then treat my test dataset later, "albert"
is encoded as 2
, and will be treated differently by the next ML steps.
# Train
train = pl.DataFrame(
{"to_encode": ["albert", "boris", "charles"]}
)
train.with_columns(
[
pl.col("to_encode").hash().alias("hashed"),
pl.col("to_encode")
.cast(pl.Categorical)
.to_physical()
.cast(pl.UInt8)
.alias("encoded"),
]
).filter(pl.col("to_encode") == "albert") # albert is encoded as 0 here
# Test
test = pl.DataFrame({"to_encode": ["daniel", "boris", "albert"]})
test.with_columns(
[
pl.col("to_encode").hash().alias("hashed"),
pl.col("to_encode")
.cast(pl.Categorical)
.to_physical()
.cast(pl.UInt8)
.alias("encoded"),
]
).filter(pl.col("to_encode") == "albert") # albert is encoded as 2 here
I could use a string cache to fix this, but I can't see a way of saving the string cache between sessions?
String cache solution, but I have to treat the train and test dataset in the same python session:
with pl.StringCache():
train2 = train.with_columns(
[
pl.col("to_encode").hash().alias("hashed"),
pl.col("to_encode")
.cast(pl.Categorical)
.to_physical()
.cast(pl.UInt8)
.alias("encoded"),
]
)
test2 = test.with_columns(
[
pl.col("to_encode").hash().alias("hashed"),
pl.col("to_encode")
.cast(pl.Categorical)
.to_physical()
.cast(pl.UInt8)
.alias("encoded"),
]
)
train2.filter(pl.col("to_encode") == "albert").vstack(test2.filter(pl.col("to_encode") == "albert"))
shape: (2, 3)
┌───────────┬─────────────────────┬─────────┐
│ to_encode ┆ hashed ┆ encoded │
│ --- ┆ --- ┆ --- │
│ str ┆ u64 ┆ u8 │
╞═══════════╪═════════════════════╪═════════╡
│ albert ┆ 2405933837371676327 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ albert ┆ 2405933837371676327 ┆ 0 │
└───────────┴─────────────────────┴─────────┘
I think it make sense to allow a seeded StringCache
.
Now that I think of it. That won't help. There is contention and also the string cache is dependent of the order.
A hash to any dtype exists, but may/will have duplicates. E.g. hashing to UInt8
is (pl.col(..).hash() % 255).cast(pl.UInt8)
(255 values in u8).
I think we need something like unify_dictionaries
in pyarrow, which would allow to "fix" local string caches so both will encode the same string as the same integer:
https://arrow.apache.org/docs/python/generated/pyarrow.ChunkedArray.html?highlight=unify_dictionaries#pyarrow.ChunkedArray.unify_dictionaries
We actually have that. We use when appending. However, this still does not solve that the categoricals are not dependent of the discovery order.
I am inclined to think that you first must sort them, compute the categories and then unsort them again.
I'm currently using hashing in order to int-encode categorical variables (strings) in my dataframe to use in gradient boosting models. In one column I have a few million unique strings, and
uint32
is an appropriate hash-encoding for this case. In another case I have a few simple strings, like1->2
up to8->8
(64 possible combinations). Here it would be nice if I could limit the hashed datatype touint8
so as to save memory. An optional argument for dtype could be passed to the python method:pl.col("a").hash(dtype=dtype)
with defaultdtype=pl.UInt32
, would be great.