scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
47 stars 34 forks source link

SLEP016: parameter spaces on estimators #62

Open jnothman opened 2 years ago

jnothman commented 2 years ago

Ready for review!

jnothman commented 2 years ago

Thanks for the prompt review @adrinjalali. At this point I'm not sure what to update based on your comments, but I am curious to understand better what you find attractive about the Searchgrid API relative to this.

jnothman commented 2 years ago

Note to self: an alternative here might be a way to alias deep parameters at the metaestimator level...

jnothman commented 2 years ago

Thanks for the review @amueller. I hope I find clarity of mind and time soon to make edits.

Over all, do you feel like we should have separate set_grid and set_distrib, or combined set_param_space whose rvs will be rejected by a grid search?

One thought that comes to mind is how to make this something that usefully integrates with external hyperparameter optimisation tools. I mean it may be their responsibility to do so, but I wonder how we assist in their support.

aidiss commented 1 year ago

I would like to propose an alternative approach. Core idea is to use mapping between an object and space, and then iterate through given pipeline and build the parameter_grid.

    pipe = make_pipeline(DecisionTree())
    object_param_grid = {DecisionTree: {"max_depth": [1, 2]}
    create_param_grid(pipe, object_param_grid) == {'decisiontree__max_depth': [1, 2]}

This is useful when the pipeline is not complex and there are no instances that use different params.

It is possible to change this a bit and use instance_to_space mapping.

    tree = DecisionTree()
    pipe = make_pipeline()
    object_param_grid = {tree: {"max_depth": [1, 2]}
    create_param_grid(pipe, object_param_grid) == {'decisiontree__max_depth': [1, 2]}

This would add more granularity, but would requite instantiating transformers/estimators before generating the grid.

Here is a source that achieves object_to_paramgrid mapping.

def _get_steps(object_, ):
    """Retrieves steps/transformers from Pipeline, FeatureUnion or ColumnTransformer objects"""
    object_to_step_name = {
        Pipeline: "steps", 
        FeatureUnion: "transformer_list",
        ColumnTransformer: "transformers",
    }

    if step_name := object_to_step_name.get(object_.__class__):
        steps = [steps for steps in getattr(object_, step_name)]
        return steps

def resolve(parent_object, object_param_grid, path="", param_grid=None):
    if param_grid is None:
        param_grid = {}

    steps = _get_steps(parent_object)
    if not steps:
        return param_grid

    for child_object_path, child_object, *_ in steps:
        full_path = '__'.join([path, child_object_path]).strip("__")
        child_object_class_name = child_object.__class__

        if object_params := object_param_grid.get(child_object_class_name):
            flattened_param_grid = {f"{full_path}__{k}": v for k, v in object_params.items()}
            param_grid = flattened_param_grid | param_grid

        param_grid = resolve(child_object, object_param_grid, path=path+child_object_path, param_grid=param_grid)
    return param_grid
jnothman commented 8 months ago

I like several aspects of that proposal, @aidiss, though I'm not sure I'd worry about supporting classes in the first version. Specifically:

One thing I'm not sure about is how to manage errors. If, for instance, the user were to clone the estimator, we'd suddenly have no grid for it if it were defined in terms of instances, and that would not be obvious to the user. On the other hand, I think this is a flaw in all or several the proposed designs.

I'll otherwise need to think about whether there are aspects of searchgrid capability that could not be reproduced with this approach.

jnothman commented 8 months ago

I've recalled @aidiss that your suggestion is pretty much the same thing that I proposed in https://github.com/scikit-learn/scikit-learn/pull/21784, although I admit that the .set interface is not particularly pleasant.

jnothman commented 8 months ago

I think I prefer the Searchgrid design, but I'm taking your word on people preferring this design to that one.

I wonder if you meant the GridFactory design, @adrinjalali? Searchgrid secretly sets attributes on estimators, so it's not a great functional/OOP design.

jnothman commented 8 months ago

I'd be keen to merge this and then have the team decide if it's the right proposal, or if we should go down one of the SLEP-free paths (e.g. GridFactory, or a lighter API variant of it like above).

We have a lot of the implementation already, and mostly need to come to a decision.

@adrinjalali interested in reviewing for merge?