Open riley-harper opened 5 days ago
Maybe the way to go is to also support the same three options for thresholds with a threshold_search
attribute.
The last commit makes the randomized search a little more flexible for users by letting them pass particular values, lists to sample from, or dictionaries defining distributions in model_parameters
. For example,
[training.model_parameter_search]
strategy = "randomized"
num_samples = 50
[[training.model_parameters]]
type = "random_forest"
# maxDepth is always 7, and impurity is always "entropy"
maxDepth = 7
impurity = "entropy"
# subsamplingRate is sampled from the interval [0.1, 0.9] uniformly
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
# numTrees is randomly sampled from the list 1, 10, 50, 100
numTrees = [1, 10, 50, 100]
Currently there are two ways to generate the list of model (hyper)parameters to search in model exploration. You can either provide a list of all of the models that you would like to test, or you can set
param_grid = true
and provide a grid of parameters and thresholds to test, like this:We would like to add a third option, randomized parameter search. With this option, users will specify parameters as either a distribution over a range or a list of choices. They'll set a new
num_samples
configuration setting which tells hlink how many model parameter settings it should sample from the given distributions.To do this, we'll need to upgrade from a single
param_grid: bool
flag to something a little more complex. Maybe a newtraining.model_parameter_search
table would work well:Users could also write this with the other table syntax for clarity.
When the strategy is "randomized", parameters can either be a list of values to be sampled from (uniformly) or a table which defines a distribution and arguments for the distribution. We may be able to make good use of
scipy.stats
here.Outstanding questions:
num_threshold_samples
to support randomized parameter search on thresholds?