scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

Target encoding categories with a single training example #413

Closed alrichardbollans closed 1 year ago

alrichardbollans commented 1 year ago

Expected Behavior

When target encoder is trained on a binary target and transforms an example where only one instance of the particular category is seen in training, I would expect the output to differ from the cases where no training instances are given for the category.

Actual Behavior

When target encoder is trained on a binary target and transforms an example where only one instance of a particular category is seen in training, the output is as if there were no training instances for the category and does not seem to take into account the value of the target in the training instance.

Steps to Reproduce the Problem

Using the following file: data.csv In the following example you can see that the encoded category is the same for e.g. 'Acokanthera' and 'Adenium' when the target value for the training instances differ. Similarly the encoded value is the same for cases with no training instances, e.g. 'Prismatomeris'

    import pandas as pd
    import category_encoders as ce
    data = pd.read_csv('data.csv')
    train = data[data['Name'] != 'Prismatomeris']

    X_train = train['Name']
    y_train = train['Target']
    target_encoder = ce.TargetEncoder(cols=['Name'])
    target_encoder.fit(X_train, y_train)
    encoded_X_train = target_encoder.transform(X_train)
    encoded_X_train.to_csv('train_encoded.csv')

    test = data[data['Name'] == 'Prismatomeris']
    X_test = test['Name']
    encoded_X_test = target_encoder.transform(X_test)
    encoded_X_test.to_csv('test_encoded.csv')

Specifications

alrichardbollans commented 1 year ago

On updating, this issue appears to have been resolved in category encoders version 2.6.0