microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.78k stars 498 forks source link

Different search space for flaml.tune vs for built-in models? #717

Open EgorKraevTransferwise opened 1 year ago

EgorKraevTransferwise commented 1 year ago

I am trying to add a new time series model to the list of FLAML's built-in ones, but have trouble specifying the search space. This model contains two component models from FLAML's builtins, and I want to search over the available component models and their respective search spaces, so the code is

    def _search_space(
        self, data: TimeSeriesDataset, task: Task, pred_horizon: int, **params
    ):
        estimators = {
            "model_lo": ["arima", "sarimax"],
            "model_hi": ["arima", "sarimax"],
        }
        out = {}
        for mdl, ests in estimators.items():
            est_cfgs = []
            for est in ests:
                est_class = task.estimator_class_from_str(est)
                est_cfgs.append(
                    {
                        "estimator": est,
                        **(est_class.search_space(data, task, pred_horizon)),
                    }
                )
            out[mdl] = tune.choice(est_cfgs)
        return out

However when I try to use that, I get the following error:

        for name, space in search_space.items():
>           assert (
                "domain" in space
            ), f"{name}'s domain is missing in the search space spec {space}"
E           AssertionError: model_lo's domain is missing in the search space spec <flaml.tune.sample.Categorical object at 0x00000199C5EC4F70>

In auto-causality, the above way of defining the nested search space works fine - am I doing something wrong or is the search space definition spec different for FLAML's built-in models, and if so, why?

sonichi commented 1 year ago

The search space dict format returned by each estimator's search_space function is different from that in flaml.tune. https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#search-space In short, each hyperparameter's value is also a dict which needs to contain "domain" (required), "init_value" (optional) and "low_cost_init_value" (optional) as keys. The value for "domain" is the same as the value in the space dict passed to flaml.tune.

EgorKraevTransferwise commented 1 year ago

Thanks! Changing the last line of the for-loop above to out[mdl] = {"domain": tune.choice(est_cfgs)} I get another error:

Can the syntax you use for estimators not deal with nested/hierarchical search spaces in the same way that flaml.tune does?

Also below see the screenshot from the debugger of what the complete search space looks like.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\flaml\automl\automl.py:1814: in fit
    self._search()
..\..\flaml\automl\automl.py:2378: in _search
    self._search_sequential()
..\..\flaml\automl\automl.py:2201: in _search_sequential
    use_ray=False,
..\..\flaml\tune\tune.py:502: in run
    trial_to_run = _runner.step()
..\..\flaml\tune\trial_runner.py:125: in step
    config = self._search_alg.suggest(trial_id)
..\..\flaml\searcher\suggestion.py:213: in suggest
    suggestion = self.searcher.suggest(trial_id)
..\..\flaml\searcher\blendsearch.py:1057: in suggest
    return super().suggest(trial_id)
..\..\flaml\searcher\blendsearch.py:747: in suggest
    init_config, self._ls_bound_min, self._ls_bound_max
..\..\flaml\searcher\flow2.py:242: in complete_config
    partial_config, self.space, self, disturb, lower, upper
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

partial_config = {'monthly_fourier_degree': 2}
space = {'model_hi': <flaml.tune.sample.Categorical object at 0x00000187CF3970C8>, 'model_lo': <flaml.tune.sample.Categorical object at 0x00000187CF391808>, 'monthly_fourier_degree': <flaml.tune.sample.Integer object at 0x00000187CF397148>}
flow2 = <flaml.searcher.flow2.FLOW2 object at 0x00000187CF391FC8>, disturb = 0
lower = {'monthly_fourier_degree': 0.14285714285714285}
upper = {'monthly_fourier_degree': 0.14285714285714285}

    def complete_config(
        partial_config: Dict,
        space: Dict,
        flow2,
        disturb: bool = False,
        lower: Optional[Dict] = None,
        upper: Optional[Dict] = None,
    ) -> Tuple[Dict, Dict]:
        """Complete partial config in space.

        Returns:
            config, space.
        """
        config = partial_config.copy()
        normalized = normalize(config, space, partial_config, {})
        # print("normalized", normalized)
        if disturb:
            for key, value in normalized.items():
                domain = space.get(key)
                if getattr(domain, "ordered", True) is False:
                    # don't change unordered cat choice
                    continue
                if not callable(getattr(domain, "get_sampler", None)):
                    continue
                if upper and lower:
                    up, low = upper[key], lower[key]
                    if isinstance(up, list):
                        gauss_std = (up[-1] - low[-1]) or flow2.STEPSIZE
                        up[-1] += flow2.STEPSIZE
                        low[-1] -= flow2.STEPSIZE
                    else:
                        gauss_std = (up - low) or flow2.STEPSIZE
                        # allowed bound
                        up += flow2.STEPSIZE
                        low -= flow2.STEPSIZE
                elif domain.bounded:
                    up, low, gauss_std = 1, 0, 1.0
                else:
                    up, low, gauss_std = np.Inf, -np.Inf, 1.0
                if domain.bounded:
                    if isinstance(up, list):
                        up[-1] = min(up[-1], 1)
                        low[-1] = max(low[-1], 0)
                    else:
                        up = min(up, 1)
                        low = max(low, 0)
                delta = flow2.rand_vector_gaussian(1, gauss_std)[0]
                if isinstance(value, list):
                    # points + normalized index
                    value[-1] = max(low[-1], min(up[-1], value[-1] + delta))
                else:
                    normalized[key] = max(low, min(up, value + delta))
        config = denormalize(normalized, space, config, normalized, flow2._random)
        # print("denormalized", config)
        for key, value in space.items():
            if key not in config:
                config[key] = value
        for _, generated in generate_variants_compatible(
            {"config": config}, random_state=flow2.rs_random
        ):
            config = generated["config"]
            break
        subspace = {}
        for key, domain in space.items():
            value = config[key]
            if isinstance(value, dict):
                if isinstance(domain, sample.Categorical):
                    # nested space
                    index = indexof(domain, value)
                    # point = partial_config.get(key)
                    # if isinstance(point, list):     # low cost point list
                    #     point = point[index]
                    # else:
                    #     point = {}
                    config[key], subspace[key] = complete_config(
                        value,
                        domain.categories[index],
                        flow2,
                        disturb,
>                       lower and lower[key][index],
                        upper and upper[key][index],
                    )
E                   KeyError: 'model_lo'

..\..\flaml\tune\space.py:543: KeyError

image

sonichi commented 1 year ago

Thanks! Changing the last line of the for-loop above to out[mdl] = {"domain": tune.choice(est_cfgs)} I get another error:

Can the syntax you use for estimators not deal with nested/hierarchical search spaces in the same way that flaml.tune does?

Also below see the screenshot from the debugger of what the complete search space looks like.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\flaml\automl\automl.py:1814: in fit
    self._search()
..\..\flaml\automl\automl.py:2378: in _search
    self._search_sequential()
..\..\flaml\automl\automl.py:2201: in _search_sequential
    use_ray=False,
..\..\flaml\tune\tune.py:502: in run
    trial_to_run = _runner.step()
..\..\flaml\tune\trial_runner.py:125: in step
    config = self._search_alg.suggest(trial_id)
..\..\flaml\searcher\suggestion.py:213: in suggest
    suggestion = self.searcher.suggest(trial_id)
..\..\flaml\searcher\blendsearch.py:1057: in suggest
    return super().suggest(trial_id)
..\..\flaml\searcher\blendsearch.py:747: in suggest
    init_config, self._ls_bound_min, self._ls_bound_max
..\..\flaml\searcher\flow2.py:242: in complete_config
    partial_config, self.space, self, disturb, lower, upper
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

partial_config = {'monthly_fourier_degree': 2}
space = {'model_hi': <flaml.tune.sample.Categorical object at 0x00000187CF3970C8>, 'model_lo': <flaml.tune.sample.Categorical object at 0x00000187CF391808>, 'monthly_fourier_degree': <flaml.tune.sample.Integer object at 0x00000187CF397148>}
flow2 = <flaml.searcher.flow2.FLOW2 object at 0x00000187CF391FC8>, disturb = 0
lower = {'monthly_fourier_degree': 0.14285714285714285}
upper = {'monthly_fourier_degree': 0.14285714285714285}

    def complete_config(
        partial_config: Dict,
        space: Dict,
        flow2,
        disturb: bool = False,
        lower: Optional[Dict] = None,
        upper: Optional[Dict] = None,
    ) -> Tuple[Dict, Dict]:
        """Complete partial config in space.

        Returns:
            config, space.
        """
        config = partial_config.copy()
        normalized = normalize(config, space, partial_config, {})
        # print("normalized", normalized)
        if disturb:
            for key, value in normalized.items():
                domain = space.get(key)
                if getattr(domain, "ordered", True) is False:
                    # don't change unordered cat choice
                    continue
                if not callable(getattr(domain, "get_sampler", None)):
                    continue
                if upper and lower:
                    up, low = upper[key], lower[key]
                    if isinstance(up, list):
                        gauss_std = (up[-1] - low[-1]) or flow2.STEPSIZE
                        up[-1] += flow2.STEPSIZE
                        low[-1] -= flow2.STEPSIZE
                    else:
                        gauss_std = (up - low) or flow2.STEPSIZE
                        # allowed bound
                        up += flow2.STEPSIZE
                        low -= flow2.STEPSIZE
                elif domain.bounded:
                    up, low, gauss_std = 1, 0, 1.0
                else:
                    up, low, gauss_std = np.Inf, -np.Inf, 1.0
                if domain.bounded:
                    if isinstance(up, list):
                        up[-1] = min(up[-1], 1)
                        low[-1] = max(low[-1], 0)
                    else:
                        up = min(up, 1)
                        low = max(low, 0)
                delta = flow2.rand_vector_gaussian(1, gauss_std)[0]
                if isinstance(value, list):
                    # points + normalized index
                    value[-1] = max(low[-1], min(up[-1], value[-1] + delta))
                else:
                    normalized[key] = max(low, min(up, value + delta))
        config = denormalize(normalized, space, config, normalized, flow2._random)
        # print("denormalized", config)
        for key, value in space.items():
            if key not in config:
                config[key] = value
        for _, generated in generate_variants_compatible(
            {"config": config}, random_state=flow2.rs_random
        ):
            config = generated["config"]
            break
        subspace = {}
        for key, domain in space.items():
            value = config[key]
            if isinstance(value, dict):
                if isinstance(domain, sample.Categorical):
                    # nested space
                    index = indexof(domain, value)
                    # point = partial_config.get(key)
                    # if isinstance(point, list):     # low cost point list
                    #     point = point[index]
                    # else:
                    #     point = {}
                    config[key], subspace[key] = complete_config(
                        value,
                        domain.categories[index],
                        flow2,
                        disturb,
>                       lower and lower[key][index],
                        upper and upper[key][index],
                    )
E                   KeyError: 'model_lo'

..\..\flaml\tune\space.py:543: KeyError

image

Right. Here we convert the dict returned by search_space function to the search space dict format required by flaml.tune. https://github.com/microsoft/FLAML/blob/87d9b35d634f8085ea87b588a85a07bf7a3b7197/flaml/automl.py#L174 but we only did it for the top level and didn't traverse the hierarchy. It didn't handle the case of hierarchical search space.

The solution would be making it recursive: if the domain is a choice of multiple child search spaces, go through each of them recursively.