[tune] Use Aggregated metric for tuning across seeds

anqixu commented 4 years ago

Not sure if this is already a feature or not, please forgive and provide insight :)

While I haven't tried yet, I understand that tune has support for search algorithms (like BO, spearmint, etc.), which decide on new hparam settings to try based on the performance of previous hparam trials.

It is known that modern RL agents tend to be very sensitive / result-dependent on their initial random number generator (RNG) seed.

I would like to spawn multiple Tune trials of the same hparam setting for an RL run, but with different rng seeds (either explicit or null-implicit). I can do this explicitly already. However, the feature that I don't know if it's implemented / possible is:

wait for all RNGseed-repeated trials to complete
compute an aggregate statistic (e.g. mean), over repeated trials, as the performance metric corresponding to the hparam values, and pass this aggregated performance output to Tune's search algorithm, to decide the next hparam settings to try

P.S.: I heard that maybe num_samples can be used that way, but I'm not sure that's valid since when I used num_samples, the hparams for each trial is sampled independently, rather than simply repeated.

Many thanks!

floringogianu commented 4 years ago

Yes, this seems to be a big problem for the usability of these methods. Here's an example of hyperparam configurations found by tune on the pybullet inverse pendulum env:

And here is the actual performance of these hyperparameters when training ten different seeds:

The agent is some n-step Actor-Critic and I used HyperOptSearch and ASHAScheduler for early stopping, with 512 trials.

richardliaw commented 4 years ago

Sorry for the late response; got deprioritized last two weeks but will aim to merge by End of week!

floringogianu commented 4 years ago

Just wanted to say thanks for this. I am currently experimenting with it.

I also worked on orchestrating manually the launch of separate N seeds for K steps from each tune trial, calling get() on each and doing the mean. This has the advantage of also working with early stopping or other schedulers.

But for some reason ray stops allocating new processes after the first "batch" of max_concurrent workers times number of seeds without any errors and it soon stops to a grind without launching the rest of the trials and only running on one or two cores. Hoping to fix this and do a comparison.

richardliaw commented 4 years ago

@floringogianu that seems like a bug. Can you post a small script for reproducing this?

Feel free to open a new issue.

ray-project / ray

[tune] Use Aggregated metric for tuning across seeds #6994