scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

CountEncoder incorrectly counts Timestamp columns #412

Closed willsthompson closed 11 months ago

willsthompson commented 1 year ago

Expected Behavior

In releases 2.6.0 and earlier, CountEncoder could encode a column of Timestamps, just like any other non-string column.

Actual Behavior

CountEncoder always returns 0.0 for Timestamp column frequencies.

Steps to Reproduce the Problem

from category_encoders import CountEncoder
from pandas import Timestamp

df = pd.DataFrame(
    {
        "TIMESTAMPS": {
            0: Timestamp("1997-09-03 00:00:00"),
            1: Timestamp("1997-09-03 00:00:00"),
            2: Timestamp("2000-09-03 00:00:00"),
            3: Timestamp("1997-09-03 00:00:00"),
            4: Timestamp("1999-09-04 00:00:00"),
            5: Timestamp("2001-09-03 00:00:00"),
        },
        "FLOATS": {
            0: 0.2856523592132305,
            1: 0.2856523592132305,
            2: 0.9002173288230475,
            3: 0.2856523592132305,
            4: 0.928510879560613,
            5: 0.5663259449524071,
        },
    }
)
cat_encoder = CountEncoder(
    cols=["TIMESTAMPS", "FLOATS"],
    normalize=True,
)
df_cat = cat_encoder.fit_transform(df)
df_expected = pd.DataFrame(
    {
        "TIMESTAMPS": {
            0: 0.5,
            1: 0.5,
            2: 0.16666666666666666,
            3: 0.5,
            4: 0.16666666666666666,
            5: 0.16666666666666666,
        },
        "FLOATS": {
            0: 0.5,
            1: 0.5,
            2: 0.16666666666666666,
            3: 0.5,
            4: 0.16666666666666666,
            5: 0.16666666666666666,
        },
    }
)
assert df_cat.equals(df_expected)

The frequencies for these two columns are the same, which are correctly encoded in 2.6.0, but fails in 2.6.1 due to the column of 0.0 values.

Specifications

bmreiniger commented 1 year ago

It looks like this is due to the underlying ordinal encoder converting the timestamps (and so is likely to affect other encoders as well):

>>> cat_encoder.ordinal_encoder.mapping
[{'col': 'TIMESTAMPS', 'mapping': 8.732448e+17    1
9.679392e+17    2
9.364032e+17    3
9.994752e+17    4
NaN            -2
dtype: int64, 'data_type': dtype('<M8[ns]')}, {'col': 'FLOATS', 'mapping': 0.285652    1
0.900217    2
0.928511    3
0.566326    4
NaN        -2
dtype: int64, 'data_type': dtype('float64')}]

and then

>>> cat_encoder.mapping['TIMESTAMPS']   
NaN    1.0
Name: TIMESTAMPS, dtype: float64

In this diff: https://github.com/scikit-learn-contrib/category_encoders/commit/d19f69ef26288ce33e5cb34d4796bb372e1b20bd
it seems that categories.tolist() converts numpy timestamps where list(categories) doesn't (?).

vkhodygo commented 1 year ago

Do you still think this is actual?

PaulWestenthanner commented 11 months ago

@vkhodygo It was but I just fixed it myself.
@bmreiniger thanks for the suggestion. I've implemented this and wrote a test based on @willsthompson 's data.