Hash Encoding Outputting High Collision Output

scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders

http://contrib.scikit-learn.org/category_encoders/

BSD 3-Clause "New" or "Revised" License

2.39k stars 393 forks source link

Hash Encoding Outputting High Collision Output #402

Open eddietaylor opened 1 year ago

eddietaylor commented 1 year ago

Expected Behavior

If I set n_components to 8, for example, I should be able to encode columns with up to 2^8 values.

Actual Behavior

Whatever I set n_components to is the number of unique records of the encoded columns I get. For example, if I set n_components to 32, there only ends up being 32 unique records. This obviously will not do as an encoder as there are countless collisions.

This issue does not come up with binary coding for example. I successfully can encode my high cardinality categorical column with it.

Steps to Reproduce the Problem

fit the transform to some high cardinality categorical column
look at the number of unique records for the col_0,..., col_7 combinations

Specifications

Version: category_encoders 2.6.0
Platform:
Subsystem:

PaulWestenthanner commented 1 year ago

Hi Eddie,

thanks for reporting this issue. I've created this example to reproduce the problem

from category_encoders import HashingEncoder
import pandas as pd
import random

categories = [str(x) for x in range(100)]
data = [random.choice(categories) for _ in range(1000)]
df = pd.DataFrame({"col_a": data})

my_enc = HashingEncoder()
df_out = my_enc.fit_transform(df)
df_out["dummy"] = 1
print(df_out.groupby([f"col_{i}" for i in range(8)]).count())

I'm really surprised that this seems to have gone unnoticed for 7 years. It seems like we need to modify this line to basically to do some logarithmic action and we need to join the strings in the whole row before hashing. https://github.com/scikit-learn-contrib/category_encoders/blob/f6349a140c8477b612a63c7d8f5cfe21139f5989/category_encoders/hashing.py#L308

Also note a similar question and discussion in #350

bmreiniger commented 1 year ago

This is expected behavior for a hashing encoder, I think. See wikipedia, and compare with sklearn's implementation.

A binary encoding based on a hash function could be interesting as a new encoder though. It could hash to integers and then apply binary encoding on those. Hashing to "ordinal" encoding first could be factored out, and the current transformer would be that followed by one-hot?

PaulWestenthanner commented 1 year ago

Thanks for pointing this out Ben!
But then our documentation and default values are wrong.

n_components: int
  how many bits to use to represent the feature. By default, we use 8 bits.
  For high-cardinality features, consider using up-to 32 bits.

8 bits to me sounds like 2^8 hash values. Also the default in sklearn is 2^20, whereas ours is 8.
Given that the HashingEncoder does not implement handle na / handle missing and does not keep the mapping, I'm starting to wonder where the benefit against sklearn is. Sure we take dataframes as input and have some multiprocessing features but we'd need to test if our implementation is faster at all. I'd suspect that sklearn uses better optimised libraries, sparse output etc.

PaulWestenthanner commented 1 year ago

I've removed the bug label since this is actually the correct behavior. Keeping the issue to discuss whether to fix the documentation or just drop the encoder all-together (in a backwards compatible way, pointing to sklearn instead)