koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

[BUG] regression in grouped estimator #692

Closed koaning closed 3 months ago

koaning commented 4 months ago

I spotted that an image in our docs no longer gives the right demo.

CleanShot 2024-07-24 at 12 12 43

It is the one that relates to the grouped estimator that demos the decay estimator. Before, the decayed line would be different than the normally grouped one.

It seems that when we copy the decay estimators internally that we somehow drop self.decay_kwargs.

MarcoGorelli commented 4 months ago

thanks @koaning for spotting this, will take a look (cc @FBruzzesi )

koaning commented 4 months ago

It might not be related to narwhals now that I rethink it, but it seems plausible to assume this broke as part of the recent effort. Am also having a look now.

koaning commented 4 months ago

Actually ... this may be a scikit-learn thing. After cloning it seems that we loose the decay_kwargs.

from sklearn.base import clone

est = DecayEstimator(DummyRegressor(), decay_func="exponential", decay_rate=0.2)
c = clone(est)
# lhs has kwargs, rhs does not
est.decay_kwargs, c.decay_kwargs
koaning commented 4 months ago

Ah yeah, this seems like a sklearn thing. The get_params is not returning dictionaries it seems, which is what the cloning methods inside of sklearn assume. I wonder if there is a reason why they omit dictionaries. Might be concious ... @MarcoGorelli for now though feel free to relive yourself of any worry. This is not related to narwhals. Will rename the issue.

FBruzzesi commented 3 months ago

Sorry it took me a bit to look into this.

I see a few possibilities for this at the moment:

  1. Conforming to the scikit-learn API would require to write it as:
    + def __init__(self, model, decay_func="exponential", check_input=False, decay_kwargs=None):
    - def __init__(self, model, decay_func="exponential", check_input=False, **decay_kwargs)
         self.model = model
         self.decay_func = decay_func
         self.check_input = check_input
         self.decay_kwargs = decay_kwargs

    and then pass a dictionary instead of kwargs. This is a breaking change in our API

  2. scikit-learn clone first try to return estimator.__sklearn_clone__(), so we can act/implement such method - however this is available since scikit-learn 1.3
  3. use from copy import deepcopy instead of sklearn clone in a few meta estimators, but I am not fully aware of the consequences.

Option 1. is the safest in terms of implementation. For option 2., scikit-learn 1.3 may be too recent, and option 3 requires more investigation.

Thoughts?

koaning commented 3 months ago

Sorry it took me a bit to look into this.

No worries! As always, there is no rush.

I see a few possibilities for this at the moment:

Ahh nice. I was thinking about option 2 myself originally but after considering it I think that option 1 is preferable. This is also how the class_weights are attached and it feels like the best practice to follow. Not to mention the simplest implementation.