koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

[BUG] Cannot use GroupedPredictor with DecayEstimator for fitting 1-feature dataframe #573

Closed decodering closed 1 year ago

decodering commented 1 year ago

Overview

Running the below works

DecayEstimator(DummyRegressor(),decay=0.9).fit(df[['m']],df['yt'])

while this doesn't (expected to work)

_ = (
    GroupedPredictor(DecayEstimator(DummyRegressor(), decay=0.9), groups=["m"])
    .fit(df[['m']], df['yt'])
)

While following Vincent's excellent sklearn crash course (See relevant screenshot below), I noticed that I wasn't able to use GroupedPredictor to fit with a single feature dataframe df[['m']] if DecayEstimator was used in conjunction. Is this the expected behaviour? I find it odd that calling DecayEstimator(DummyRegressor(),decay=0.9).fit(df[['m']],df['yt']) works, but failes when wrapped w GroupedPredictor.

As far as I understand, in this example the DummyRegressor just calcs the mean of df['yt'] for the grouped period, so adding the index column here or any other cols don't seem to do anything (which looks to be the case when I tried adding a random col in place of index). I'm not 100% sure on this, so please correct me if I'm wrong!

Full error description and reproducable snippet below.

image

Error description

The error: ValueError: Found array with 0 feature(s) (shape=(1825, 0)) while a minimum of 1 is required by DecayEstimator.

image image

Snippet producing error

Example snippet:

from sklearn.dummy import DummyRegressor
from sklego.meta import GroupedPredictor, DecayEstimator
from sklego.datasets import make_simpleseries
from itertools import combinations

yt = make_simpleseries(seed=1)
dates = pd.date_range("2000-01-01", periods=len(yt))
df = (
    pd.DataFrame({"yt": yt,"date": dates})
    .assign(m=lambda d: d.date.dt.month)
    .reset_index()
)

df['a_useless_col'] = 0

# WORKS: Without wrapping the GroupedPredictor
_ = DecayEstimator(DummyRegressor(), decay=0.9).fit(df[['m']],df['yt']).predict(df[['m']])

# WORKS: Without the DecayEstimator
mod1 = (
    GroupedPredictor(DummyRegressor(), groups=["m"])
    .fit(df[['m']], df['yt'])
)

# WORKS: Adding a useless col when wrapping the GroupedPredictor
mod2 = (
    GroupedPredictor(DecayEstimator(DummyRegressor(), decay=0.9), groups=["m"])
    .fit(df[['m','a_useless_col']], df['yt'])
)

# WORKS: Adding the 'index' as shown in tutorial when wrapping the GroupedPredictor
mod3 = (
    GroupedPredictor(DecayEstimator(DummyRegressor(), decay=0.9), groups=["m"])
    .fit(df[['m','index']], df['yt'])
)

# ERROR: Wrapping the GroupedPredictor with DecayEstimator
mod4 = (
    GroupedPredictor(DecayEstimator(DummyRegressor(), decay=0.9), groups=["m"])
    .fit(df[['m']], df['yt'])
)

Additional columns does not seem to be taking any effect on the model output:

# The output predictions of models 2,3 are the same, regardless of the added column in input 'X'
mod2_preds_m = mod2.predict(df[['m']])
mod2_preds_m_useless = mod2.predict(df[['m','a_useless_col']])
mod3_preds_m = mod3.predict(df[['m']])
mod3_preds_m_useless = mod3.predict(df[['m','index']])

array_equal_statements = []
for pair in combinations([
    mod2_preds_m,
    mod2_preds_m_useless,
    mod3_preds_m,
    mod3_preds_m_useless
], 2):
    array_equal_statements.append(np.array_equal(pair[0], pair[1]))

print(f"Are all {len(array_equal_statements)} arrays equal?: {all(array_equal_statements)}")
# OUPUTS: Are all 6 arrays equal?: True

Further comments

From what I can see in the error msg, it seems like it is reducing the dimensions so that it becomes an empty dataframe. I feel like this could be a result of dropping the m column somewhere for some reason when specifying grouping = ['m'] in the GroupedPredictor params.

koaning commented 1 year ago

It's been a while since I've done anything with this library (over 2-3 years) but from the looks of this:

mod4 = (
    GroupedPredictor(DecayEstimator(DummyRegressor(), decay=0.9), groups=["m"])
    .fit(df[['m']], df['yt'])
)

... I kind of get what's going awry. You're passing df[['m']] and the GroupedEstimator is making sure that m is not passed to the other nested models. So you'd effectively be passing an empty dataframe to the DecayEstimator, so it makes sense that it's complaining. The DummyRegressor ignores all the input, while the DecayEstimator does require some input to be around. This seems similar to what I'm doing in the screenshot, or am I skipping over something?

The documentation also shows the exact same examples.

koaning commented 1 year ago

I guess theoretically at least, we could choose to change the DecayEstimator to not run the check_X_y step anymore, or to make it optional. Internally, it doesn't need to use any of the X-variables.

I think it might also make sense to be a bit more explicit in the docs to explain why index is passed along.