koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.27k stars 117 forks source link

preprocessing.RepeatingBasisFunction encodes different labels as the same[BUG] #574

Open yug030 opened 1 year ago

yug030 commented 1 year ago

Hi,

I'm working on a task to encode timestamps (more specifically, weekdays) and I noticed RepeatingBasisFunction does not work correctly with weekday=0 and weekday=6: the outputs for both are the same.

Here's the code snippet to reproduce the bug:

import pandas as pd
from sklego.preprocessing import RepeatingBasisFunction

start_date = '2023-06-25 00:00:00'
end_date = '2023-07-03 23:59:59'
timestamps = pd.date_range(start=start_date, end=end_date, freq='1D')
df = pd.DataFrame({'Timestamps': timestamps})
df['wkday'] = df['Timestamps'].dt.weekday
print(df)
rbf = RepeatingBasisFunction(n_periods=4, column="wkday", input_range=(0,6), remainder="drop")
rbf.fit(df)
print(rbf.transform(df))

The output for me is:

  Timestamps  wkday
0 2023-06-25      6
1 2023-06-26      0
2 2023-06-27      1
3 2023-06-28      2
4 2023-06-29      3
5 2023-06-30      4
6 2023-07-01      5
7 2023-07-02      6
8 2023-07-03      0

array([[1.        , 0.36787944, 0.01831564, 0.36787944],
       [1.        , 0.36787944, 0.01831564, 0.36787944],
       [0.64118039, 0.89483932, 0.16901332, 0.06217652],
       [0.16901332, 0.89483932, 0.64118039, 0.06217652],
       [0.01831564, 0.36787944, 1.        , 0.36787944],
       [0.16901332, 0.06217652, 0.64118039, 0.89483932],
       [0.64118039, 0.06217652, 0.16901332, 0.89483932],
       [1.        , 0.36787944, 0.01831564, 0.36787944],
       [1.        , 0.36787944, 0.01831564, 0.36787944]])

Update: If I add 1 to "wkday" column (so it's 1 to 7 now) and use rbf = RepeatingBasisFunction(n_periods=7, column="wkday", input_range=(0,7), remainder="drop") and call fit and transform, I will get a square matrix with 1.0 on the diagonal which looks correct, but it's different from the documentation.

FBruzzesi commented 12 months ago

Hi, thank you for reporting the issue and sorry for the late reply!

This seems to be unexpected/buggy behavior. My understanding is that some assumptions have been made in the implementation and it may be worth to adjust or clearly state them.

I will take the time to have a closer look at it.

Alex-Cremers commented 4 months ago

It's a bit confusing, but I think that this is expected behavior. Theinput_range argument is to be understood as an interval that loops on itself at the boundaries, so if you provide (0,6) you assume that 6 is the same as 0, leaving only distinct 6 integers in the range. In this case, you want to provide (0,7), which translates to the interval [0, 7) and does contain 7 distinct integers.

If you think in terms of continuous variables rather than integer, it possibly makes more sense.