Set bit-size of int dtype of .hash()

thomasaarholt commented 2 years ago

I'm currently using hashing in order to int-encode categorical variables (strings) in my dataframe to use in gradient boosting models. In one column I have a few million unique strings, and uint32 is an appropriate hash-encoding for this case. In another case I have a few simple strings, like 1->2 up to 8->8 (64 possible combinations). Here it would be nice if I could limit the hashed datatype to uint8 so as to save memory. An optional argument for dtype could be passed to the python method:

pl.col("a").hash(dtype=dtype) with default dtype=pl.UInt32, would be great.

thomasaarholt commented 2 years ago

I will add that it's currently a bit difficult to use sklearn-type encoding algorithms with polars:

from sklearn.preprocessing import OrdinalEncoder
import polars as pl
import numpy as np

df = pl.DataFrame({"to_encode": ["a", "b", None], "some_value": [1, 2, 3]})

enc = OrdinalEncoder(dtype=np.int8)
df_encode = df.select("to_encode")
data = df_encode.to_numpy()
transformed = enc.fit_transform(data)

df_transformed = pl.DataFrame(
    {col + "2": transformed[:, i] for i, col in enumerate(df_encode.columns)}
)
df.hstack(df_transformed)

shape: (3, 3)
┌───────────┬────────────┬────────────┐
│ to_encode ┆ some_value ┆ to_encode2 │
│ ---       ┆ ---        ┆ ---        │
│ str       ┆ i64        ┆ i8         │
╞═══════════╪════════════╪════════════╡
│ a         ┆ 1          ┆ 0          │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b         ┆ 2          ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null      ┆ 3          ┆ 2          │
└───────────┴────────────┴────────────┘

ghuls commented 2 years ago

You can use,

pl.col("to_encode").cast(pl.Categorical).to_physical().cast(pl.UInt8)

ritchie46 commented 2 years ago

@ghuls answer is perfect. Our local categoricals are exactly ordinal encodings.

thomasaarholt commented 2 years ago

This is super neat! Thank you!

thomasaarholt commented 2 years ago

The only issue I have here is that the encoded value is dependent on the order in which it appears in the dataset - while the hashed value is always reproducible.

Imagine if I first treat my "train" dataset in the following way. Here "albert" is encoded to 0. If I then treat my test dataset later, "albert" is encoded as 2, and will be treated differently by the next ML steps.

# Train
train = pl.DataFrame(
    {"to_encode": ["albert", "boris", "charles"]}
)
train.with_columns(
    [
        pl.col("to_encode").hash().alias("hashed"),
        pl.col("to_encode")
        .cast(pl.Categorical)
        .to_physical()
        .cast(pl.UInt8)
        .alias("encoded"),
    ]
).filter(pl.col("to_encode") == "albert")  # albert is encoded as 0 here

# Test
test = pl.DataFrame({"to_encode": ["daniel", "boris", "albert"]})
test.with_columns(
    [
        pl.col("to_encode").hash().alias("hashed"),
        pl.col("to_encode")
        .cast(pl.Categorical)
        .to_physical()
        .cast(pl.UInt8)
        .alias("encoded"),
    ]
).filter(pl.col("to_encode") == "albert")  # albert is encoded as 2 here

I could use a string cache to fix this, but I can't see a way of saving the string cache between sessions?

thomasaarholt commented 2 years ago

String cache solution, but I have to treat the train and test dataset in the same python session:

with pl.StringCache():

    train2 = train.with_columns(
        [
            pl.col("to_encode").hash().alias("hashed"),
            pl.col("to_encode")
            .cast(pl.Categorical)
            .to_physical()
            .cast(pl.UInt8)
            .alias("encoded"),
        ]
    )

    test2 = test.with_columns(
        [
            pl.col("to_encode").hash().alias("hashed"),
            pl.col("to_encode")
            .cast(pl.Categorical)
            .to_physical()
            .cast(pl.UInt8)
            .alias("encoded"),
        ]
    )

train2.filter(pl.col("to_encode") == "albert").vstack(test2.filter(pl.col("to_encode") == "albert"))

shape: (2, 3)
┌───────────┬─────────────────────┬─────────┐
│ to_encode ┆ hashed              ┆ encoded │
│ ---       ┆ ---                 ┆ ---     │
│ str       ┆ u64                 ┆ u8      │
╞═══════════╪═════════════════════╪═════════╡
│ albert    ┆ 2405933837371676327 ┆ 0       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ albert    ┆ 2405933837371676327 ┆ 0       │
└───────────┴─────────────────────┴─────────┘

ritchie46 commented 2 years ago

I think it make sense to allow a seeded StringCache.

ritchie46 commented 2 years ago

Now that I think of it. That won't help. There is contention and also the string cache is dependent of the order.

A hash to any dtype exists, but may/will have duplicates. E.g. hashing to UInt8 is (pl.col(..).hash() % 255).cast(pl.UInt8) (255 values in u8).

ghuls commented 2 years ago

I think we need something like unify_dictionaries in pyarrow, which would allow to "fix" local string caches so both will encode the same string as the same integer: https://arrow.apache.org/docs/python/generated/pyarrow.ChunkedArray.html?highlight=unify_dictionaries#pyarrow.ChunkedArray.unify_dictionaries

ritchie46 commented 2 years ago

We actually have that. We use when appending. However, this still does not solve that the categoricals are not dependent of the discovery order.

I am inclined to think that you first must sort them, compute the categories and then unsort them again.

pola-rs / polars

Set bit-size of int dtype of .hash() #4111