option for evaluation runs for statistic

bjuergens commented 3 years ago

goal: have an option to define extra evaluation runs for the best individual of each generation. The score of these extra evaluation is than added to stats. It is also used for the HoF.

Background: This will make the learning-curve of our experiments much more comparible to related work (iirc they did it like this in the back2basic paper)

todo

[ ] add configuration block to the episoderunnnerconf
[ ] add separate evaluation-method to algorithm.py
[ ] call these method in cma-es and mulambda es
[ ] pass results to stats
[ ] add the new information to Log.txt
[ ] update plot_experiment.py to draw the new information

pdeubel commented 3 years ago

I would add the option to choose a number of random seeds to evaluate the best individual on different seeds of the environment. These seeds would be the same throughout an experiment, so that each best individual of a generation is evaluated multiple times each time with a different seed, but in the next generation the same seeds as in the previous generation are used.

This would measure the generalization of the best individual per generation on different states of the environment. A learning curve could then be built by using the mean and standard deviation from the multiple evaluations per generation.

bjuergens commented 3 years ago

First of all: We should establish names for the concepts here.

How should we call these special evaluation runs?

I think for starters the following sub-config should be added to episoderunnnerconf

class EvaluationConf:
    number_fitness_runs
    max_steps_per_run

All parameters should be optional. If they exist, they overide the values for the test run, otherwise the value from the "regular" eprunner are used.

Only the best individual of a generation shall be re-evaluated

The seeds shall be fixed to range(number_fitness_runs)

Things to added later:

environment_attributes
number of individuals to use for special evalation
alternative criterium for selecting the indivual to use for special evaluation

pdeubel commented 3 years ago

How should we call these special evaluation runs?

Difficult 😄 A synonym for evaluation would be assessment. Other than that maybe something like evaluation after training or evaluation for paper. But I am not very convinced on the names

All parameters should be optional. If they exist, they overide the values for the test run, otherwise the value from the "regular" eprunner are used.

I agree for max_steps_per_run but for number_fitness_runs I would not necessarily. Suppose the original experiment used only one for number_fitness_runs then we won't evaluate on different seeds unless we explicitly set the parameter higher. If it is possible I would add a minimum of 5 for it.

environment_attributes

Why would we need that? This would be in the original config and should not be changed for these evaluation purposes, right?

bjuergens commented 3 years ago

then we won't evaluate on different seeds unless we explicitly set the parameter higher

I disagree.

All special evaluation runs should be completely separate from the regular training-evaluation runs. I think in the future we will even have special evaluations runs, which even run on different gyms than the regular evaluation runs. (e.g. when we train using our own procgen env, but we use the canonical procgen for special evaluation).

How should we call these special evaluation runs?

We could use the canoncial vocabulary from https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets maybe?

Then the regular evaluation runs would be called "training episodes". The new special evaluation runs would be called "validation episodes". And we may even add "test episodes" in the future, which we run after the experiments to compare the fully trained individuals between different experiments.

@pdeubel @DanielZim Do you agree with the names "training episodes" and "validation episodes"?

for the random seeds for the special evaluation, I suggest we add an option first_validation_seed, which defaults to 1. And the seeds used for validation are just the numbers from first_validation_seed to first_validation_seed + number_fitness_runs in that order

bjuergens commented 3 years ago

environment_attributes

Why would we need that? This would be in the original config and should not be changed for these evaluation purposes, right?

right now we don't. In the future we may want to use different difficulties for the validations episodes.

For instance: heist has a internal value "difficulty", which is a number between 1 and 10 (or so), which affects the number of keys and the size of the maze. In the canoncial heist environment the difficulty randomly selected between 4 and 10 (i think). In our custom proc-gen branch we could set the minumum difficulty via environment_attributes (assuming someone has implemented that).

We could do the following experiment:

training episodes use a difficulty between 1 and 7.
evaluation episodes use a difficulty between 8 and 10.

And then we can look at the training curve and observe, how training on simpler envs could affect more difficult envs.

Or we could train on the memory-reacher, and use different values for the memory phase.

But: For now we do not need this. It is only a nice-to-have for the future (for now)

pdeubel commented 3 years ago

I disagree.

I do not disagree with that quite opposite I proposed on the second comment to use explicitly new runs with the individuals. My point is more a detail of the implementation. Example: When the original experiment has say number_fitness_runs=1 and then we want to run the special evaluations and do not specify (for example because we forgot) number_fitness_runs in the EvaluationConf only one episode per generation will be done if we do not explicitly overwrite the parameter. This is an undesired behavior. So I would suggest to say for the special evaluations number_fitness_runs must be at least 5 so that we have a good measure on different environment seeds.

The new special evaluation runs would be called "validation episodes"

Wouldn't they be called testing episodes? I think the validation episodes are done during training to get a sense of the non-training performance. But the testing set is not touched unless training is completely over. So in our case the training is over and this would then be testing episodes. But I very much like the names 👍🏽

right now we don't. In the future we may want to use different difficulties for the validations episodes.

Yep ok makes sense.

But: For now we do not need this. It is only a nice-to-have for the future (for now)

Yes I agree

bjuergens commented 3 years ago

ah, ok.

So I would suggest to say for the special evaluations number_fitness_runs must be at least 5 so that we have a good measure on different environment seeds.

how about we just don't allow a default paramter for this? Then the user is forced to explicitly set this variable if they want to use validation runs.

Wouldn't they be called testing episodes?

yes, i think? I don't know.

For clearification: I think "testing episodes" happen after the experiment is done. And "validation episodes" are happening in each generation during the experiment.

DanielZim commented 3 years ago

then we won't evaluate on different seeds unless we explicitly set the parameter higher

I disagree.

All special evaluation runs should be completely separate from the regular training-evaluation runs. I think in the future we will even have special evaluations runs, which even run on different gyms than the regular evaluation runs. (e.g. when we train using our own procgen env, but we use the canonical procgen for special evaluation).

How should we call these special evaluation runs?

We could use the canoncial vocabulary from https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets maybe?

Then the regular evaluation runs would be called "training episodes". The new special evaluation runs would be called "validation episodes". And we may even add "test episodes" in the future, which we run after the experiments to compare the fully trained individuals between different experiments.

@pdeubel @DanielZim Do you agree with the names "training episodes" and "validation episodes"?

for the random seeds for the special evaluation, I suggest we add an option first_validation_seed, which defaults to 1. And the seeds used for validation are just the numbers from first_validation_seed to first_validation_seed + number_fitness_runs in that order

I guess the training, validation and test data sets that are normally used do not 100 % fit into our scheme:

For training I agree, we can call these training episodes. The name validation episodes are also fine, these should be env seeds that are not contained in the training env seeds. However, the test episodes are a little bit different in my opinion. In ML context, test data is used after hyper parameter optimization, so there should be no tuning of the hyper parameters after using the test data. In our experiments, the mentioned test episodes after the experiment is finished are also validation episodes in my opinion. I rarely (never?) saw the use of test data in reinforcement learning context. In my opinion, we do not need the classical test data episodes after hyper parameter optimization.

On the other side, I think the validation runs are very important for our experiments. Especially to get a meaningful hof. As you mentioned, we should evaluate the fitness of the best individual in each generation on X fixed seeds (maybe 0 to 9) and

a) add this information to the output log to get a better estimation of the training process b) add this individual to the hof if it belongs to the best N overall individuals

Maybe not extremly important but I would prefer starting with the first seed = 0 (not 1), An undefined seed would be seed = -1. I guess the gym env seeds are also starting with seed=0.

neuroevolution-ai / NeuroEvolution-CTRNN_new

option for evaluation runs for statistic #39