Closed muupan closed 3 years ago
/test
Successfully created a job for commit a97428c:
Thank you for this!
I'm definitely for this addition, I think changing the examples/gym/train_dqn_gym.py
might cause some confusion with people trying to compare results. Perhaps we could add a comment or warning to your change directing them to this PR for reference? We could make this temporary with the intention of removing it after the release after this gets merged in.
You mean adding a warning to examples/gym/train_dqn_gym.py
, correct?
Yes, I did.
That sounds nice. I will add a warning.
/test
Successfully created a job for commit a3f5ddb:
@ummavi
I resolved the conflicts and added a warning message to train_dqn_gym.py
:
(pfrl) ➜ pfrl git:(eval-during-episode) ✗ python examples/gym/train_dqn_gym.py --env CartPole-v0 --gpu -1 --eval-n-runs 10 --eval-interval 1000 --steps 10000
Output files are saved in results/a3f5ddb93eeea71a9941a48978350ddbf1b2d6a9-00000000-17d39bf4
WARNING: Since https://github.com/pfnet/pfrl/pull/112 we have started setting `eval_during_episode=True` in this script, which affects the timings of evaluation phases.
INFO:pfrl.experiments.train_agent:outdir:results/a3f5ddb93eeea71a9941a48978350ddbf1b2d6a9-00000000-17d39bf4 step:13 episode:0 R:0.013000000000000005
INFO:pfrl.experiments.train_agent:statistics:[('average_q', nan), ('average_loss', nan), ('cumulative_steps', 13), ('n_updates', 0), ('rlen', 13)]
INFO:pfrl.experiments.train_agent:outdir:results/a3f5ddb93eeea71a9941a48978350ddbf1b2d6a9-00000000-17d39bf4 step:27 episode:1 R:0.014000000000000005
INFO:pfrl.experiments.train_agent:statistics:[('average_q', nan), ('average_loss', nan), ('cumulative_steps', 27), ('n_updates', 0), ('rlen', 27)]
...
(pfrl) ➜ pfrl git:(eval-during-episode) ✗ cat results/a3f5ddb93eeea71a9941a48978350ddbf1b2d6a9-00000000-17d39bf4/scores.txt
steps episodes elapsed mean median stdev max min average_q average_loss cumulative_steps n_updates rlen
1000 43 2.3906562328338623 9.4 9.0 0.5163977794943222 10.0 9.0 0.11232906 0.01908952370285988 1000 1 1000
2000 86 15.7501220703125 189.7 200.0 32.57145989973431 200.0 97.0 0.5273197 0.0019056166338850744 2000 1001 2000
3000 114 26.208837032318115 76.4 73.5 7.947046970765654 92.0 69.0 0.7134292 0.002025567170785507 3000 2001 3000
4000 143 39.69370198249817 78.6 77.0 9.628660919936433 93.0 65.0 0.8871368 0.0031038593433913774 4000 3001 4000
5000 168 51.10098910331726 80.0 82.5 11.832159566199232 96.0 64.0 1.0027406 0.0035164057277143 5000 4001 5000
6000 188 59.89870023727417 78.1 78.5 4.954235000930461 85.0 72.0 1.1799519 0.0032080489926738665 6000 5001 6000
7000 199 68.57331418991089 198.2 200.0 5.692099788303083 200.0 182.0 1.328854 0.0022577637553331444 7000 6001 7000
8000 210 76.6203100681305 104.7 104.0 5.558776843874918 112.0 97.0 1.4316108 0.002671265401004348 8000 7001 8000
9000 224 84.00722813606262 114.0 114.5 2.581988897471611 117.0 110.0 1.5011693 0.0025710580812301487 9000 8001 9000
10000 237 91.97368311882019 149.9 152.0 13.328749211968674 172.0 133.0 1.4498457 0.002194677170191426 10000 9001 10000
Background
PFRL's
train_agent_with_evaluation
runs evaluation only when training episodes finish. This is good whenenv
andeval_env
are the same environment, e.g., the real-world robot setup or some external program that is hard to run multiple instances of. However, when you can make aneval_env
instance that is independent fromenv
, there is no good reason to wait until training episodes end.In fact,
train_agent_batch
does not wait until episodes end because when there are multiple training envs waiting until all the episodes end is inefficient and the logic would be complicated.The downside of waiting until episodes end is that actual evaluation intervals does not follow the value of
eval_interval
. Here isscores.txt
ofpython examples/gym/train_dqn_gym.py --env CartPole-v0 --gpu -1 --eval-n-runs 10 --eval-interval 1000 --steps 10000
ofmaster
. Look at the column ofsteps
.This is annoying when you visualize training curves. I believe, when we can make independent env instances for training and evaluation, it is better to avoid this.
What this PR does
eval_during_episode
option totrain_agent_with_evaluation
so that users can avoid such waiting and run evaluations on time even during training episodeseval_during_episode=True
andeval_during_episode=False
eval_during_episode=True
inexamples/gym/train_dqn_gym.py
for illustration. As I wrote above, I believe this is a better choice in most cases, but this PR keeps it optional elsewhere.agent.get_statistics
could change. This is why I made a small change to the existing test case.Here are
scores.txt
of the same command after the change.