Open jnothman opened 2 years ago
Thanks for the prompt review @adrinjalali. At this point I'm not sure what to update based on your comments, but I am curious to understand better what you find attractive about the Searchgrid API relative to this.
Note to self: an alternative here might be a way to alias deep parameters at the metaestimator level...
Thanks for the review @amueller. I hope I find clarity of mind and time soon to make edits.
Over all, do you feel like we should have separate set_grid
and set_distrib
, or combined set_param_space
whose rvs will be rejected by a grid search?
One thought that comes to mind is how to make this something that usefully integrates with external hyperparameter optimisation tools. I mean it may be their responsibility to do so, but I wonder how we assist in their support.
I would like to propose an alternative approach. Core idea is to use mapping between an object and space, and then iterate through given pipeline and build the parameter_grid.
pipe = make_pipeline(DecisionTree())
object_param_grid = {DecisionTree: {"max_depth": [1, 2]}
create_param_grid(pipe, object_param_grid) == {'decisiontree__max_depth': [1, 2]}
This is useful when the pipeline is not complex and there are no instances that use different params.
It is possible to change this a bit and use instance_to_space mapping.
tree = DecisionTree()
pipe = make_pipeline()
object_param_grid = {tree: {"max_depth": [1, 2]}
create_param_grid(pipe, object_param_grid) == {'decisiontree__max_depth': [1, 2]}
This would add more granularity, but would requite instantiating transformers/estimators before generating the grid.
Here is a source that achieves object_to_paramgrid mapping.
def _get_steps(object_, ):
"""Retrieves steps/transformers from Pipeline, FeatureUnion or ColumnTransformer objects"""
object_to_step_name = {
Pipeline: "steps",
FeatureUnion: "transformer_list",
ColumnTransformer: "transformers",
}
if step_name := object_to_step_name.get(object_.__class__):
steps = [steps for steps in getattr(object_, step_name)]
return steps
def resolve(parent_object, object_param_grid, path="", param_grid=None):
if param_grid is None:
param_grid = {}
steps = _get_steps(parent_object)
if not steps:
return param_grid
for child_object_path, child_object, *_ in steps:
full_path = '__'.join([path, child_object_path]).strip("__")
child_object_class_name = child_object.__class__
if object_params := object_param_grid.get(child_object_class_name):
flattened_param_grid = {f"{full_path}__{k}": v for k, v in object_params.items()}
param_grid = flattened_param_grid | param_grid
param_grid = resolve(child_object, object_param_grid, path=path+child_object_path, param_grid=param_grid)
return param_grid
I like several aspects of that proposal, @aidiss, though I'm not sure I'd worry about supporting classes in the first version. Specifically:
*SearchCV
implementations, with such a dict being a drop-in replacement for the current param_grid
or param_distributions
. Therefore the change would not require a SLEP.One thing I'm not sure about is how to manage errors. If, for instance, the user were to clone the estimator, we'd suddenly have no grid for it if it were defined in terms of instances, and that would not be obvious to the user. On the other hand, I think this is a flaw in all or several the proposed designs.
I'll otherwise need to think about whether there are aspects of searchgrid capability that could not be reproduced with this approach.
I've recalled @aidiss that your suggestion is pretty much the same thing that I proposed in https://github.com/scikit-learn/scikit-learn/pull/21784, although I admit that the .set
interface is not particularly pleasant.
I think I prefer the Searchgrid design, but I'm taking your word on people preferring this design to that one.
I wonder if you meant the GridFactory design, @adrinjalali? Searchgrid secretly sets attributes on estimators, so it's not a great functional/OOP design.
I'd be keen to merge this and then have the team decide if it's the right proposal, or if we should go down one of the SLEP-free paths (e.g. GridFactory, or a lighter API variant of it like above).
We have a lot of the implementation already, and mostly need to come to a decision.
@adrinjalali interested in reviewing for merge?
Ready for review!