koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.25k stars 117 forks source link

[FEATURE] - Grid search across model parameters AND thresholds with Thresholder() without refitting #551

Open mcallaghan opened 1 year ago

mcallaghan commented 1 year ago

Thanks for this great set of extensions to sklearn.

The Tresholder() model is quite close to something I've been looking for for a while.

I'm looking to include threshold optimisation as part of a broader parameter search.

I can perhaps best describe the desired behaviour as follows

for each parameters in grid:
    fit model with parameters
    for each threshold in thresholds:
        evaluate model

However, if I pass a model that has not yet been fit to Thresholder(), then, even with refit=False, the same model is fit also for each threshold.

Is there an easy way around this? Thinking about this the best way to achieve this would be tinkering with the GridSearchCV code, but perhaps you have an idea and would also find this interesting?

Thanks!

MBrouns commented 1 year ago

I havent tested this so maybe I'm completely off the mark, but I think you can do this by nesting GridSearchCV objects:

model = make_pipeline(
   ...,
   LogisticRegression()
)

param_gridsearch = GridSearchCV(
   model,
   param_grid=...
)

param_gridsearch.fit()

threshol_gridsearch = GridSearchCV(
   Thresholder(param_gridsearch, refit=False),
   param_grid={'threshold: [0.1, 0.2, ...]}
)
FBruzzesi commented 11 months ago

@MBrouns before closing the issue, could it be worth adding an example in the docs?

FBruzzesi commented 11 months ago

Having a closer look at this: actually the two approaches are a bit different. The implementation of

for each parameters in grid:
    fit model with parameters
    for each threshold in thresholds:
        evaluate model

would still require to run thresholder for each fitted model, while the suggestion is to run it only on the best model.

Maybe a nested GridSearchCV does the trick? (I never tried that)

mod = GridSearchCV(
    estimator = Thresholder(
        GridSearchCV(
            estimator = SomeModel(),
            param_grid={...},
            ...
        ),
        threshold=0.1,
        refit=False
    ),
    param_grid = {
        "threshold": np.linspace(0.1, 0.9, 10),
        },
    ...
)

_ = mod.fit(X, y)