ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.35k stars 5.65k forks source link

[tune] Support nesting grid_search in lambdas #3466

Open gehring opened 5 years ago

gehring commented 5 years ago

System information

Describe the problem

The documentation for generate_variants(...) states that unresolved parameters lambda and grid_search can be nested. From ray/tune/suggest/variant_generator.py, line 33:

"""It is also possible to nest the two, e.g. have a lambda function return a grid search or vice versa, as long as there are no cyclic dependencies between unresolved values. """

However, it seems like even the following simple case fails:

import ray
from ray import tune

def dummy_fn(config, reporter):
  print(config)

def resolve_b(spec):
  values = [i**spec.config.a for i in range(2, 4)]
  return tune.grid_search(values)

exp_config = {
  "dummy_exp": {
    "run": dummy_fn,
    "config": {"a": tune.grid_search([1, 2]),
               "b": resolve_b},
  },
}

ray.init()
tune.run_experiments(exp_config)

with error message (raised from ray/tune/suggest/variant_generator.py, line 129):

ValueError: The variable `('config', 'b')` could not be unambiguously resolved to a single value. Consider simplifying your variable dependencies.

Additionally, the error handling seems off as the error message appears incorrect in blaming dependencies. This example fails for me in the same way as the other case despite having no dependencies:

import ray
from ray import tune

def dummy_fn(config, reporter):
  print(config)

exp_config = {
  "dummy_exp": {
    "run": dummy_fn,
    "config": {"a": tune.grid_search([1, 2]),
               "b": lambda spec: tune.grid_search([3, 4])},
  },
}

ray.init()
tune.run_experiments(exp_config)
richardliaw commented 5 years ago

The second case seems to be a little odd, and I think perhaps the best thing to do here is just to tone down the documentation.

You can achieve what you intend to do without nesting the grid search right?

import ray
from ray import tune

def dummy_fn(config, reporter):
  b2 = config["b"] ** config["a"]
  print(config)

exp_config = {
  "dummy_exp": {
    "run": dummy_fn,
    "config": {"a": tune.grid_search([1, 2]),
               "b": tune.grid_search([2,3,4])},
  },
}

ray.init()
tune.run_experiments(exp_config)
gehring commented 5 years ago

I meant those examples to be simple, minimal uses of nested grid search. The actual use case that lead me to this issue involves the desire to search different parts of the space based on other parameters. This can occur when some variable a alters the scale or meaning of other variables. Specifically, I was trying to use it to pick different values for the learning rate to search over based on which optimizer is currently being tried, e.g.:

import numpy as np

import ray
from ray import tune

# Mapping from optimizer and batch size to learning rate range to explore.
LR_RANGES = {
    ("sgd", 1): (0.0001, 0.01),
    ("sgd", 100): (0.001, 0.1),
    ("adam", 1): (0.00001, 0.01),
    ("adam", 100): (0.0001, 0.1),
}

def dummy_fn(config, reporter):
  print(config)

# Generate a grid search for the learning rate for a given method and
# batch size.
def resolve_learning_rate(spec):
  method = spec.config.method
  batch_size = spec.config.batch_size
  lr_range = LR_RANGES((method, batch_size))
  values = np.geomspace(*lr_range)
  return tune.grid_search(values)

exp_config = {
  "dummy_exp": {
    "run": dummy_fn,
    "config": {
      "method": tune.grid_search(["sgd", "adam"]),
      "batch_size": tune.grid_search([1, 100]),
      "learning_rate": resolve_learning_rate,
    },
  },
}

ray.init()
tune.run_experiments(exp_config)

Granted, since, in this example, the ranges themselves don't depend on the parameters, we could work around it by splitting the experiment into smaller ones, and then doing our own grid search. However, it is possible for a situation to arise where we would want the range for some hyper-parameter, e.g., step size, trusted region, to depend on the scale of some randomly sampled hyper-parameter, e.g., RBF width, rewards clipped at some value. I'm not sure this has a simple workaround beyond writing a custom SearchAlgorithm class.

Additionally, Introducing a new parameters who's semantics are dictated by the need for a workaround is somewhat undesirable. Using your proposed workaround to illustrate why this might be the case, while this solution works fine, we've now introduced a new parameter whose meaning is dictated by implementation restrictions rather than from more stable/abstract concepts. This makes understanding what is happening less clear and more prone to bugs as the experiment is forked/modified. Additionally, workaround like these are likely to be omitted from any technical write up).

That being said, I don't mean to overplay the importances of this feature. I think either removing problematic documentation or fixing the nested grid_search functionality can be easily argued for, and are both acceptable solutions. On one hand, simply removing support/mention of the feature is easy and would help keep tune lean, while, on the other, fixing the feature would allow tune users to keep some types of grid search config clean and easy to read.

However, I am still concerned with how the issue manifests itself. Given how inaccurate the error message is, it would probably be wise to make sure there isn't some more fundamental issue at play as it doesn't appear like this is where the code assumes such an error would occur. I feel like fixing this would is likely to lead to an easy fix allowing nested grid search to work properly. Maybe someone familiar with this code could look into it and identify where the broken assumption lies.

tl; dr

IMO, whether removing or fixing the nesting feature is best depends on how difficult it would be to fix the feature. I do think there is value is supporting it but, given it only affect some rare cases, many of which have acceptable workaround, it is not worth bloating tune and, therefore, increasing tune's maintenance cost. Though, even if the feature is removed, the way the error is generated hints that there are some (potentially benign) issues with the assumed behavior of some of the private _resolve* functions that should be looked into.

Edits: typo fixes

hartikainen commented 5 years ago

I've been confused by the same error myself when I tried running nested grid search. I would also benefit from nested grid_search support, however, it's not a very high priority for me since so far I've always managed to work around it.

hartikainen commented 5 years ago

I've recently had several cases where I would've benefitted from being able to nest the lambdas in variant spec. I can still work around this by creating python conditions outside of the variant spec definition, but it's slowly starting to make my variant definitions more complicated.

Here's a bit simplified example from the latest one for reference:

from pprint import pprint

from ray import tune
from ray.tune.suggest.variant_generator import generate_variants

def metric_learner_kwargs(spec):
    spec = spec.get('config', spec)
    shared_kwargs = {
        'learning_rate': 3e-4,
        'hidden_layer_sizes': (256, 256),
    }

    if spec['metric_learner_params']['type'] == 'OnPolicyMetricLearner':
        kwargs = {
            **shared_kwargs,
            **{
                'train_every_n_steps': 128,
            }
        }
    elif spec['metric_learner_params']['type'] == 'HingeMetricLearner':
        kwargs = {
            **shared_kwargs,
            **{
                'train_every_n_steps': 1,
                'lambda_estimator_kwargs': {},
                'max_distance': tune.sample_from(lambda spec: (
                    10 + spec.get('config')['max_path_length']
                ))
            }
        }

    return kwargs

spec = {
    'max_length': 100,
    'metric_learner_params': {
        'type': tune.grid_search([
            'OnPolicyMetricLearner',
            'HingeMetricLearner'
        ]),
        'kwargs': tune.sample_from(metric_learner_kwargs)
    }
}

pprint(list(generate_variants(spec)))

Edit: Here's even simpler example covering the same case:

spec = {
    'test1': tune.grid_search(['a', 'b']),
    'test2': tune.grid_search(['c', 'd']),
    'test3': tune.sample_from(lambda spec: (
        {
            'a': tune.sample_from(lambda spec: spec['test2'] * 2),
            'b': tune.sample_from(lambda spec: spec['test2'] * 3),
        }[spec['test1']]
    ))
}
krfricke commented 4 years ago

The following code works in the latest master:

tune.run(
    training_function,
    config={
        'test1': tune.grid_search(['a', 'b']),
        'test2': tune.grid_search(['c', 'd']),
        'test3': tune.sample_from(lambda spec: {
                'a': tune.sample_from(lambda spec: spec.config['test2'] * 2),
                'b': tune.sample_from(lambda spec: spec.config['test2'] * 3),
            }[spec.config['test1']])
    })

If there's still cases we don't consider, please open a new issue. Thanks!

hartikainen commented 4 years ago

Nice, love it, thanks @krfricke!

gehring commented 4 years ago

@krfricke This issue was for nesting grid_search inside a lambda function (see above for some use cases). Is this now supported in master?

krfricke commented 4 years ago

Ah, this looks a bit more complicated. I'll look into this again later. let me re-open this for now until I checked it.

krfricke commented 3 years ago

The main problem why we currently don't support this is that we currently require to know the number of samples in advance. However, with dependent grid search spaces, the number of grid search elements can vary. Even if we introduce a method to specify the number of elements and fix it, resolving the grid search variables would required some significant overhaul of our current search space resolution algorithm.

We can keep this issue open to track this request, but it's unlikely we'll get to this soon.

ricardobatistad commented 3 years ago

Also interested in nesting grid_search inside a lambda function:

config = {
        "optimizer"    : tune.grid_search(["AdamW", "SGDW"]),
        "scheduler"    : tune.grid_search(["none", "cosine", "warm_restarts", "plateau"]),

        "lr"           : tune.uniform(1e-7, 1e-3),
        "weight_decay" : tune.uniform(0, 1e-1),
        "nesterov"     : lambda spec: tune.grid_search([True, False]) if spec.config.optimizer == "SGDW" else None,
        "momentum"     : lambda spec: tune.uniform(0, 1) if spec.config.optimizer == "SGDW" else None,
        "dampening"    : lambda spec: None if spec.config.nesterov is None else spec.config.nesterov*np.random.uniform(),

        "T_0"          : lambda spec: tune.grid_search([5, 10]) if spec.config.scheduler == "warm_restarts" else None,
        "T_mult"       : lambda spec: tune.grid_search([1, 2]) if spec.config.scheduler == "warm_restarts" else None,
        "factor"       : tune.sample_from(lambda spec: (spec.config.scheduler == "plateau") * np.random.uniform())
        }

Alternatively, this sort of nesting:

config = {
        "optimizer" : tune.grid_search([
            {"SGDW": {
                "lr"           : tune.uniform(1e-7, 1e-3),
                "weight_decay" : tune.uniform(0, 1e-1),
                "nesterov"     : tune.grid_search([True, False]),
                "momentum"     : tune.uniform(0, 1),
                "dampening"    : tune.sample_from(lambda spec: spec.config.optimizer.SGDW.nesterov * np.random.uniform())}},
             {"AdamW" : {
                 "lr"            : tune.uniform(1e-7, 1e-3),
                 "weight_decay" : tune.uniform(0, 1e-1)}}]),
        "scheduler" : tune.grid_search([
            {"none"  : None},
            {"cosine" : {
                "T_max" : set_mod.NUM_EPOCHS}},
            {"warm_restarts" : {
                "T_0"    : tune.grid_search([5, 10]),
                "T_mult" : tune.grid_search([1, 2])}},
            {"plateau" : {
                "factor"   : tune.uniform(0, 1),
                "patience" : 10}}])
        }

Edit

Gave up on conditioning, specified space according to HyperOpt protocol (and now it works):


lr = [5e-7, 1e-3]
weight_decay = [0, 1e-3]
T_0    = [5, 10]
T_mult = [1, 2]

space = {
    "optimization" : hp.choice("optimization",
                                [{"name"         : "AdamW_none",
                                  "lr"           : hp.loguniform('lr_1', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_1', weight_decay[0], weight_decay[1])},

                                 {"name"         : "AdamW_cosine",
                                  "lr"           : hp.loguniform('lr_2', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_2', weight_decay[0], weight_decay[1])},

                                 {"name"         : "AdamW_restarts",
                                  "lr"           : hp.loguniform('lr_3', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_3', weight_decay[0], weight_decay[1]),
                                  "T_0"          : hp.choice("T_0_1", T_0),
                                  "T_mult"       : hp.choice("T_mult_1", T_mult)},

                                 {"name"         : "AdamW_plateau",
                                  "lr"           : hp.loguniform('lr_4', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_4', weight_decay[0], weight_decay[1]),
                                  "factor"       : hp.uniform("factor_1", 0, 1)},

                                 {"name"         : "SGDW_none",
                                  "lr"           : hp.loguniform('lr_5', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_5', weight_decay[0], weight_decay[1]),
                                  "momentum"     : hp.uniform("momentum_1", 0, 1),
                                  "dampening"    : hp.uniform("dampening_1", 0, 1)},

                                 {"name"         : "SGDW_cosine",
                                  "lr"           : hp.loguniform('lr_6', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_6', weight_decay[0], weight_decay[1]),
                                  "momentum"     : hp.uniform("momentum_2", 0, 1),
                                  "dampening"    : hp.uniform("dampening_2", 0, 1)},

                                 {"name"         : "SGDW_restarts",
                                  "lr"           : hp.loguniform('lr_7', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_7', weight_decay[0], weight_decay[1]),
                                  "momentum"     : hp.uniform("momentum_3", 0, 1),
                                  "dampening"    : hp.uniform("dampening_3", 0, 1),
                                  "T_0"          : hp.choice("T_0_2", T_0),
                                  "T_mult"       : hp.choice("T_mult_2", T_mult)},

                                 {"name"         : "SGDW_plateau",
                                  "lr"           : hp.loguniform('lr_8', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_8', weight_decay[0], weight_decay[1]),
                                  "momentum"     : hp.uniform("momentum_4", 0, 1),
                                  "dampening"    : hp.uniform("dampening_4", 0, 1),
                                  "factor"       : hp.uniform("factor_2", 0, 1)},

                                 {"name"         : "SGDW_none_nesterov",
                                  "lr"           : hp.loguniform('lr_9', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_9', weight_decay[0], weight_decay[1]),
                                  "momentum"     : hp.uniform("momentum_5", 0, 1)},

                                 {"name"         : "SGDW_cosine_nesterov",
                                  "lr"           : hp.loguniform('lr_10', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_10', weight_decay[0], weight_decay[1]),
                                  "momentum"     : hp.uniform("momentum_6", 0, 1)},

                                 {"name"         : "SGDW_restarts_nesterov",
                                  "lr"           : hp.loguniform('lr_11', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_11', weight_decay[0], weight_decay[1]),
                                  "momentum"     : hp.uniform("momentum_7", 0, 1),
                                  "T_0"          : hp.choice("T_0_3", T_0),
                                  "T_mult"       : hp.choice("T_mult_3", T_mult)},

                                 {"name"         : "SGDW_plateau_nesterov",
                                  "lr"           : hp.loguniform('lr_12', np.log(lr[0]), np.log(lr[1])),
                                  "weight_decay" : hp.uniform('weight_decay_12', weight_decay[0], weight_decay[1]),
                                  "momentum"     : hp.uniform("momentum_8", 0, 1),
                                  "factor"       : hp.uniform("factor_3", 0, 1)}
                                 ])
    }
sidphbot commented 3 years ago

@krfricke how do you describe the nested parameter name in parameter_columns in reporter?

config = mycustomconfig() # nested DictConfig for the PL model

adding search space

config['model']['aggregation']['out_channels'] = tune.choice([48, 96]) config['data']['loader']['train_len'] = tune.choice([5000, 50000, 2500])

...

reporter = CLIReporter( parameter_columns=["out_channels", "train_len"], # ???? what to write will this suffice? metric_columns=["val_loss", "epoch"])

krfricke commented 3 years ago

Can you try using / as separators, e.g. model/aggregation/out_channels?