Closed willsthompson closed 11 months ago
It looks like this is due to the underlying ordinal encoder converting the timestamps (and so is likely to affect other encoders as well):
>>> cat_encoder.ordinal_encoder.mapping
[{'col': 'TIMESTAMPS', 'mapping': 8.732448e+17 1
9.679392e+17 2
9.364032e+17 3
9.994752e+17 4
NaN -2
dtype: int64, 'data_type': dtype('<M8[ns]')}, {'col': 'FLOATS', 'mapping': 0.285652 1
0.900217 2
0.928511 3
0.566326 4
NaN -2
dtype: int64, 'data_type': dtype('float64')}]
and then
>>> cat_encoder.mapping['TIMESTAMPS']
NaN 1.0
Name: TIMESTAMPS, dtype: float64
In this diff: https://github.com/scikit-learn-contrib/category_encoders/commit/d19f69ef26288ce33e5cb34d4796bb372e1b20bd
it seems that categories.tolist()
converts numpy timestamps where list(categories)
doesn't (?).
Do you still think this is actual?
@vkhodygo It was but I just fixed it myself.
@bmreiniger thanks for the suggestion. I've implemented this and wrote a test based on @willsthompson 's data.
Expected Behavior
In releases 2.6.0 and earlier, CountEncoder could encode a column of Timestamps, just like any other non-string column.
Actual Behavior
CountEncoder always returns 0.0 for Timestamp column frequencies.
Steps to Reproduce the Problem
The frequencies for these two columns are the same, which are correctly encoded in 2.6.0, but fails in 2.6.1 due to the column of 0.0 values.
Specifications