pfnet / pfrl

PFRL: a PyTorch-based deep reinforcement learning library
MIT License
1.2k stars 157 forks source link

Evaluation during training episodes #112

Closed muupan closed 3 years ago

muupan commented 3 years ago

Background

PFRL's train_agent_with_evaluation runs evaluation only when training episodes finish. This is good when env and eval_env are the same environment, e.g., the real-world robot setup or some external program that is hard to run multiple instances of. However, when you can make an eval_env instance that is independent from env, there is no good reason to wait until training episodes end.

In fact, train_agent_batch does not wait until episodes end because when there are multiple training envs waiting until all the episodes end is inefficient and the logic would be complicated.

The downside of waiting until episodes end is that actual evaluation intervals does not follow the value of eval_interval. Here is scores.txt of python examples/gym/train_dqn_gym.py --env CartPole-v0 --gpu -1 --eval-n-runs 10 --eval-interval 1000 --steps 10000 of master. Look at the column of steps.

steps   episodes    elapsed mean    median  stdev   max min average_q   average_loss    cumulative_steps    n_updates   rlen
1021    49  0.46405887603759766 8.9 9.0 0.8755950357709131  10.0    8.0 0.30439645  0.0225488511972468  1021    22  1021
2026    88  3.4596009254455566  195.9   200.0   12.965338406690355  200.0   159.0   0.7258416   0.002005111983162351    2026    1027    2026
3008    117 6.283432960510254   124.4   125.5   13.696714934611146  140.0   100.0   1.0374576   0.0035956327240273824   3008    2009    3008
4021    151 8.994767904281616   93.3    93.0    8.59004591890456    107.0   75.0    1.1364052   0.002998790842539165    4021    3022    4021
5025    176 11.687598943710327  82.6    83.5    3.687817782917155   87.0    75.0    1.3751132   0.003210274265729822    5025    4026    5025
6045    195 14.706429958343506  156.6   154.0   17.225304383699903  200.0   141.0   1.6338439   0.005055326806614175    6045    5046    6045
7081    208 17.660122871398926  134.6   135.0   4.501851470969102   141.0   127.0   1.7221845   0.005496036116965115    7081    6082    7081
8004    218 20.26002287864685   114.5   114.0   2.9533408577782247  119.0   111.0   1.7510117   0.004004615361336619    8004    7005    8004
9023    232 23.045633792877197  105.2   104.5   4.871686908385363   112.0   99.0    1.958236    0.004077585676568561    9023    8024    9023
10000   241 25.780963897705078  146.7   148.0   8.857514073122072   160.0   135.0   1.9815891   0.0058340844104532155   10000   9001    10000

This is annoying when you visualize training curves. I believe, when we can make independent env instances for training and evaluation, it is better to avoid this.

What this PR does

Here are scores.txt of the same command after the change.

steps   episodes    elapsed mean    median  stdev   max min average_q   average_loss    cumulative_steps    n_updates   rlen
1000    48  0.4554009437561035  10.2    9.5 2.2997584414213788  16.0    8.0 0.17600906  0.032559558749198914    1000    1   1000
2000    91  3.330997943878174   90.1    80.0    36.73160915493781   173.0   53.0    1.2416949   0.004746826794289518    2000    1001    2000
3000    123 6.097692251205444   97.7    97.0    5.696977756280567   108.0   92.0    1.6330086   0.00885181687597651 3000    2001    3000
4000    140 8.853048086166382   148.8   148.5   4.1041983924323695  155.0   143.0   1.9203513   0.007965251127025112    4000    3001    4000
5000    166 11.703951120376587  144.2   142.5   6.762642481555071   159.0   137.0   2.1411452   0.010642013618489727    5000    4001    5000
6000    184 14.453600883483887  48.0    46.5    6.236095644623235   63.0    41.0    2.08529 0.008300355058163405    6000    5001    6000
7000    198 17.330674171447754  177.9   173.5   20.206984491066997  200.0   155.0   2.1882925   0.009315705448389054    7000    6001    7000
8000    204 20.148621082305908  142.9   142.0   6.6907896893167 160.0   136.0   2.4848661   0.009241904776426963    8000    7001    8000
9000    213 22.904212951660156  111.5   112.0   2.9907264074877267  116.0   106.0   2.7800713   0.015680431808577852    9000    8001    9000
10000   223 25.65988802909851   120.8   121.5   3.011090610836324   125.0   116.0   3.0515478   0.018133082903223113    10000   9001    10000
muupan commented 3 years ago

/test

pfn-ci-bot commented 3 years ago

Successfully created a job for commit a97428c:

ummavi commented 3 years ago

Thank you for this! I'm definitely for this addition, I think changing the examples/gym/train_dqn_gym.py might cause some confusion with people trying to compare results. Perhaps we could add a comment or warning to your change directing them to this PR for reference? We could make this temporary with the intention of removing it after the release after this gets merged in.

muupan commented 3 years ago

You mean adding a warning to examples/gym/train_dqn_gym.py, correct?

ummavi commented 3 years ago

Yes, I did.

muupan commented 3 years ago

That sounds nice. I will add a warning.

muupan commented 3 years ago

/test

pfn-ci-bot commented 3 years ago

Successfully created a job for commit a3f5ddb:

muupan commented 3 years ago

@ummavi

I resolved the conflicts and added a warning message to train_dqn_gym.py:

(pfrl) ➜  pfrl git:(eval-during-episode) ✗ python examples/gym/train_dqn_gym.py --env CartPole-v0 --gpu -1 --eval-n-runs 10 --eval-interval 1000 --steps 10000
Output files are saved in results/a3f5ddb93eeea71a9941a48978350ddbf1b2d6a9-00000000-17d39bf4
WARNING: Since https://github.com/pfnet/pfrl/pull/112 we have started setting `eval_during_episode=True` in this script, which affects the timings of evaluation phases.
INFO:pfrl.experiments.train_agent:outdir:results/a3f5ddb93eeea71a9941a48978350ddbf1b2d6a9-00000000-17d39bf4 step:13 episode:0 R:0.013000000000000005
INFO:pfrl.experiments.train_agent:statistics:[('average_q', nan), ('average_loss', nan), ('cumulative_steps', 13), ('n_updates', 0), ('rlen', 13)]
INFO:pfrl.experiments.train_agent:outdir:results/a3f5ddb93eeea71a9941a48978350ddbf1b2d6a9-00000000-17d39bf4 step:27 episode:1 R:0.014000000000000005
INFO:pfrl.experiments.train_agent:statistics:[('average_q', nan), ('average_loss', nan), ('cumulative_steps', 27), ('n_updates', 0), ('rlen', 27)]
...
(pfrl) ➜  pfrl git:(eval-during-episode) ✗ cat results/a3f5ddb93eeea71a9941a48978350ddbf1b2d6a9-00000000-17d39bf4/scores.txt
steps   episodes    elapsed mean    median  stdev   max min average_q   average_loss    cumulative_steps    n_updates   rlen
1000    43  2.3906562328338623  9.4 9.0 0.5163977794943222  10.0    9.0 0.11232906  0.01908952370285988 1000    1   1000
2000    86  15.7501220703125    189.7   200.0   32.57145989973431   200.0   97.0    0.5273197   0.0019056166338850744   2000    1001    2000
3000    114 26.208837032318115  76.4    73.5    7.947046970765654   92.0    69.0    0.7134292   0.002025567170785507    3000    2001    3000
4000    143 39.69370198249817   78.6    77.0    9.628660919936433   93.0    65.0    0.8871368   0.0031038593433913774   4000    3001    4000
5000    168 51.10098910331726   80.0    82.5    11.832159566199232  96.0    64.0    1.0027406   0.0035164057277143  5000    4001    5000
6000    188 59.89870023727417   78.1    78.5    4.954235000930461   85.0    72.0    1.1799519   0.0032080489926738665   6000    5001    6000
7000    199 68.57331418991089   198.2   200.0   5.692099788303083   200.0   182.0   1.328854    0.0022577637553331444   7000    6001    7000
8000    210 76.6203100681305    104.7   104.0   5.558776843874918   112.0   97.0    1.4316108   0.002671265401004348    8000    7001    8000
9000    224 84.00722813606262   114.0   114.5   2.581988897471611   117.0   110.0   1.5011693   0.0025710580812301487   9000    8001    9000
10000   237 91.97368311882019   149.9   152.0   13.328749211968674  172.0   133.0   1.4498457   0.002194677170191426    10000   9001    10000