Support a randomized parameter search in model exploration

riley-harper commented 5 days ago

Currently there are two ways to generate the list of model (hyper)parameters to search in model exploration. You can either provide a list of all of the models that you would like to test, or you can set param_grid = true and provide a grid of parameters and thresholds to test, like this:

model_parameters = [
  {type = "random_forest", maxDepth = [5, 15, 25], numTrees = [50, 75, 100], threshold = [0.5, 0.6, 0.7], threshold_ratio = [1.0, 1.2, 1.3], minInstancesPerNode=[1, 2]}
]

We would like to add a third option, randomized parameter search. With this option, users will specify parameters as either a distribution over a range or a list of choices. They'll set a new num_samples configuration setting which tells hlink how many model parameter settings it should sample from the given distributions.

To do this, we'll need to upgrade from a single param_grid: bool flag to something a little more complex. Maybe a new training.model_parameter_search table would work well:

# Equivalent to param_grid = true. We can still accept param_grid = true but print a
# deprecation message and internally convert it to this representation.
model_parameter_search = {strategy = "grid"}

# Equivalent to param_grid = false. Like param_grid = true, we can accept this but deprecate it.
# In this mode, we just take exactly what's in model_parameters and test it.
# This is still the default.
model_parameter_search = {strategy = "explicit"}

# The new feature.
model_parameter_search = {strategy = "randomized", num_samples = 20}

Users could also write this with the other table syntax for clarity.

[training.model_parameter_search]
strategy = "randomized"
num_samples = 20
seed = 111

When the strategy is "randomized", parameters can either be a list of values to be sampled from (uniformly) or a table which defines a distribution and arguments for the distribution. We may be able to make good use of scipy.stats here.

[[model_parameters]]
type = "random_forest"
maxDepth = {low = 5, high = 26, distribution = "randint"}
numTrees = {low = 50, high = 101, distribution = "randint"}
minInstancesPerNode = [1, 2]
# Not entirely sure how this will work yet
threshold = {low = 0.5, high = 0.7, distribution = "uniform"}
threshold_ratio = {low = 1.0, high = 1.3, distribution = "uniform"}

Outstanding questions:

How do thresholds work with randomized parameter search? Is it possible that we'd want to do grid search on the thresholds, but do randomized parameter search on the hyperparameters? Should we have a separate num_threshold_samples to support randomized parameter search on thresholds?
Which distributions should we support? "randint" and "uniform" seem indispensable.

riley-harper commented 5 days ago

Maybe the way to go is to also support the same three options for thresholds with a threshold_search attribute.

riley-harper commented 4 days ago

The last commit makes the randomized search a little more flexible for users by letting them pass particular values, lists to sample from, or dictionaries defining distributions in model_parameters. For example,

[training.model_parameter_search]
strategy = "randomized"
num_samples = 50

[[training.model_parameters]]
type = "random_forest"
# maxDepth is always 7, and impurity is always "entropy"
maxDepth = 7
impurity = "entropy"
# subsamplingRate is sampled from the interval [0.1, 0.9] uniformly
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
# numTrees is randomly sampled from the list 1, 10, 50, 100
numTrees = [1, 10, 50, 100]

ipums / hlink

Support a randomized parameter search in model exploration #167