ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
12 stars 2 forks source link

Support a randomized parameter search in model exploration #167

Open riley-harper opened 5 days ago

riley-harper commented 5 days ago

Currently there are two ways to generate the list of model (hyper)parameters to search in model exploration. You can either provide a list of all of the models that you would like to test, or you can set param_grid = true and provide a grid of parameters and thresholds to test, like this:

model_parameters = [
  {type = "random_forest", maxDepth = [5, 15, 25], numTrees = [50, 75, 100], threshold = [0.5, 0.6, 0.7], threshold_ratio = [1.0, 1.2, 1.3], minInstancesPerNode=[1, 2]}
]

We would like to add a third option, randomized parameter search. With this option, users will specify parameters as either a distribution over a range or a list of choices. They'll set a new num_samples configuration setting which tells hlink how many model parameter settings it should sample from the given distributions.

To do this, we'll need to upgrade from a single param_grid: bool flag to something a little more complex. Maybe a new training.model_parameter_search table would work well:

# Equivalent to param_grid = true. We can still accept param_grid = true but print a
# deprecation message and internally convert it to this representation.
model_parameter_search = {strategy = "grid"}
# Equivalent to param_grid = false. Like param_grid = true, we can accept this but deprecate it.
# In this mode, we just take exactly what's in model_parameters and test it.
# This is still the default.
model_parameter_search = {strategy = "explicit"}
# The new feature.
model_parameter_search = {strategy = "randomized", num_samples = 20}

Users could also write this with the other table syntax for clarity.

[training.model_parameter_search]
strategy = "randomized"
num_samples = 20
seed = 111

When the strategy is "randomized", parameters can either be a list of values to be sampled from (uniformly) or a table which defines a distribution and arguments for the distribution. We may be able to make good use of scipy.stats here.

[[model_parameters]]
type = "random_forest"
maxDepth = {low = 5, high = 26, distribution = "randint"}
numTrees = {low = 50, high = 101, distribution = "randint"}
minInstancesPerNode = [1, 2]
# Not entirely sure how this will work yet
threshold = {low = 0.5, high = 0.7, distribution = "uniform"}
threshold_ratio = {low = 1.0, high = 1.3, distribution = "uniform"}

Outstanding questions:

riley-harper commented 5 days ago

Maybe the way to go is to also support the same three options for thresholds with a threshold_search attribute.

riley-harper commented 4 days ago

The last commit makes the randomized search a little more flexible for users by letting them pass particular values, lists to sample from, or dictionaries defining distributions in model_parameters. For example,

[training.model_parameter_search]
strategy = "randomized"
num_samples = 50

[[training.model_parameters]]
type = "random_forest"
# maxDepth is always 7, and impurity is always "entropy"
maxDepth = 7
impurity = "entropy"
# subsamplingRate is sampled from the interval [0.1, 0.9] uniformly
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
# numTrees is randomly sampled from the list 1, 10, 50, 100
numTrees = [1, 10, 50, 100]