scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

question: Is the HashingEncoder meant to be used over multiple variables at the time? #350

Closed solegalli closed 1 year ago

solegalli commented 2 years ago

I am trying to understand the output of the HashingEncoder when encoding more than 1 variable.

I am working with the credit approval dataset from the UCI, which can be found in the UCI website

My understanding is that if I want to encode a variable with say 10 categories into 4 features, each category will be assigned a value from 0 to 3 through a hashing function, to then be assigned to one of the 4 features during the encoding. In other words, the hashing function returns the index that will allocate a 1 to the corresponding feature.

And I can see that in action if I run the following code:

from category_encoders.hashing import HashingEncoder

encoder = HashingEncoder(cols=["A7"], n_components=4)

encoder.fit(X_train)

X_train_enc = encoder.transform(X_train)

The encoded dataset contains the four "hashing features" at the beginning of the dataframe, and the number 1 indicates that the category was allocated to that particular feature:

     col_0  col_1  col_2  col_3 A1     A2     A3 A4 A5  A6      A8 A9 A10  \
596      0      0      1      0  a  46.08  3.000  u  g   c   2.375  t   t   
303      0      0      1      0  a  15.92  2.875  u  g   q   0.085  f   f   
204      0      0      1      0  b  36.33  2.125  y  p   w   0.085  t   t   
351      0      1      0      0  b  22.17  0.585  y  p  ff   0.000  f   f   
118      0      0      1      0  b  57.83  7.040  u  g   m  14.000  t   t   

     A11 A12 A13    A14   A15  
596    8   t   g  396.0  4159  
303    0   f   g  120.0     0  
204    1   f   g   50.0  1187  
351    0   f   g  100.0     0  
118    6   t   g  360.0  1332  

And if I explore the unique values of those features, I can see that they only take values 0 or 1:

    for c in ["col_0", "col_1",  "col_2", "col_3"]:
        print(X_train_enc[c].unique())

    [0 1]
    [0 1]
    [1 0]
    [0 1]

Now, if I instead encode multiple categorical variables using the HashingEncoder, I obtain something that I am not sure I understand:

from category_encoders.hashing import HashingEncoder

encoder = HashingEncoder(cols=["A5",'A7', "A12", "A14"], n_components=4)

encoder.fit(X_train)

X_train_enc = encoder.transform(X_train)

The encoded dataset contains the four "hashed features" at the beginning of the dataframe, but now they take values beyond 0 and 1:

     col_0  col_1  col_2  col_3 A1     A2     A3 A4  A6      A8 A9 A10  A11  \
596      0      2      2      0  a  46.08  3.000  u   c   2.375  t   t    8   
303      0      1      2      1  a  15.92  2.875  u   q   0.085  f   f    0   
204      1      0      2      1  b  36.33  2.125  y   w   0.085  t   t    1   
351      0      1      2      1  b  22.17  0.585  y  ff   0.000  f   f    0   
118      1      1      2      0  b  57.83  7.040  u   m  14.000  t   t    6   

    A13   A15  
596   g  4159  
303   g     0  
204   g  1187  
351   g     0  
118   g  1332 

Which I corroborate if I explore the unique values of those features:

for c in ["col_0", "col_1",  "col_2", "col_3"]:
    print(X_train_enc[c].unique())

[0 1 2]
[2 1 0 3]
[2 0 1 3 4]
[0 1 2]

Do I understand correctly that a number 2 would mean that categories from 2 different variables were allocated to that particular feature?

Is this the expected behaviour?

Somehow, I expected that I would get 4 hashed features per variable, and not 4 hashed features in total. Should this not be the case?

PaulWestenthanner commented 2 years ago

Hi @solegalli

I'm not super familiar with the HashingEncoder but I've just read up on the referenced literature (given in the docs) and hope I can answer your question.

If you're trying to encode a feature with 10 categories into 4 bits (n_categories=4) the hashing encoder will pretty much do ordinal encoding since 4 bits can store 16 different values.

As for the question of encoding multiple variables, the medium blog post (https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087) actually mentions both options: define one hash-space globally for all variables to be encoded as well as defining one hash space for each variable to be encoded. As far as I can tell category encoder implements only the global approach: https://github.com/scikit-learn-contrib/category_encoders/blob/305924d0712a07ac6f830f7983223341fc137013/category_encoders/hashing.py#L391

In fact the original paper (https://alex.smola.org/papers/2009/Weinbergeretal09.pdf) also only mentions that global approach. This is because the examples they are giving are usually in some NLP domain where they work on tokenized data sets. Their usual assumption is that the data set has a column for each word in the (Oxford English) Dictionary (bag-of-words tokenization). In those cases it makes a lot of sense to hash all features together. Note that in their case the features are not even categorical.

I agree with you that the output shoud only be 0s or 1s. This is indeed strange and needs further investigation. Probably this is a bug. Thank you for pointing this out.

So in summary I'd say:

bmreiniger commented 2 years ago

If you're trying to encode a feature with 10 categories into 4 bits (n_categories=4) the hashing encoder will pretty much do ordinal encoding since 4 bits can store 16 different values.

I don't think this is right; each row should have exactly one nonzero entry?

As for the question of encoding multiple variables, [...]. As far as I can tell category encoder implements only the global approach[...]

I agree. One could implement the per-feature version using a ColumnTransformer with many single-feature transformers, although it's not particularly elegant.

In fact the original paper [...]

That paper also has some quirks related to separate tasks/users each having their own hash plus a global one, all mapping into some common space.

I agree with you that the output shoud only be 0s or 1s. This is indeed strange and needs further investigation. Probably this is a bug. Thank you for pointing this out.

I agree with Soledad that it probably occurs when multiple features hash into the same place; the hits are additive; it's the +=1 instead of =1 here: https://github.com/scikit-learn-contrib/category_encoders/blob/305924d0712a07ac6f830f7983223341fc137013/category_encoders/hashing.py#L383

This should lead to the property that (in the multiple input features each with a single category to hash situation) every transformed row will have sum equal to the number of input columns. I could see maybe having an additional parameter to binarize the output instead, if that is an interesting use-case (but I don't know much about it either); I wouldn't suggest changing the behavior outright.

PaulWestenthanner commented 2 years ago

Thanks for that additional clarification @bmreiniger

PaulWestenthanner commented 1 year ago

looking at this again with issue #402 in mind I'd suggest changing this https://github.com/scikit-learn-contrib/category_encoders/blob/f6349a140c8477b612a63c7d8f5cfe21139f5989/category_encoders/hashing.py#L301

to val = ''.join(x.values) and then hash the combined string from all columns.

and this needs to be changed to output a list with all bits filled https://github.com/scikit-learn-contrib/category_encoders/blob/f6349a140c8477b612a63c7d8f5cfe21139f5989/category_encoders/hashing.py#L308

bmreiniger commented 1 year ago

[join then] hash the combined string from all columns.

This would lose all information about rows having the same category in the first column, if they had different values in other columns.

PaulWestenthanner commented 1 year ago

never mind my suggestion. You're right of course. So basically our documentation is just wrong since it suggests that n-components is the number of bits. I'll close this issue again