mihahauke / deep_rl_vizdoom

Deep reinforcement learning in ViZDoom (using Tensorflow)
19 stars 8 forks source link

why it gets mean score by 21 when training but gets only 6.2 when testing........ #7

Open ouyangzhuzhu opened 6 years ago

ouyangzhuzhu commented 6 years ago

I train the agent with A3C, hi~, is there any differences between the training and testing sets? The agent got mean score by 21 but got only 6 when testing... Is there any differences between the testing method when training and testing?

Here is the training result at epoch 30:

EPOCH 30 TRAIN: 15360164(GlobalSteps), 110 episodes, mean: 21.682±1.97, min: 11.000, max: 25.000, LocalSpd: 282 STEPS/s GlobalSpd: 524 STEPS/s, 1.89M STEPS/hour, total elapsed time: 8h 8m 44s TEST: mean: 23.040±1.50, min: 18.000, max: 25.000, test time: 1m 10s Learning rate: 2.38661577896e-05 Saving model to: models/defend_center/example_a3c_center/ACLstmNet/05.30_14-42/model Time: 22:52

Here is the testing result for 10 episodes:

d3alg@ubuntu-59:/home/lab/wsy/deep_rl_vizdoom$ CUDA_VISIBLE_DEVICES=4 ./test_a3c.py models/defend_center/example_a3c_center/ACLstmNet/05.30_14-42 -e 10 --seed 123 Mean score: 6.200

Miffyli commented 6 years ago

Judging by this, testing is done with deterministic policy by default. I.e. instead of sampling action from the policy's distribution, agent picks the action with highest probability.

When it comes to evaluating the performance of agent during training, people often just print and report the performance during training (the printout you see during training). Reinforcement learning in funny in a sense that your test set is your training set (unless you want to evaluate something specific).

ouyangzhuzhu commented 6 years ago

3ks for your explanation , and here u mention the deterministic policy by default when testing ,where does the deterministic policy comes from? Does it come from the well trained model?

Miffyli commented 6 years ago

The policy comes from the trained model yes. In this case "the policy" (i.e. our agent) outputs probability for each action given a state, and these probabilities correspond to what agent has learned (higher probability = better action).

You can deal with these probabilities at least in two ways: 1) You pick random action according to probabilities provided by the policy (higher probability = higher chance of picking that action). This is non-deterministic, since you won't always pick same action for same state. Or, 2) Pick action which has highest probability. This is deterministic, since you always pick the same action for same state.

The second one might first sound like a better approach: We want to pick best actions in each state, after all. However there are publications and toy examples showing that stochasticity can improve the performance (or is required for high reward). From my personal experiences sampling an action with A3C in ViZDoom is better.

ouyangzhuzhu commented 6 years ago

3ks a lot Miffyli for your answers!!!!!

ouyangzhuzhu commented 6 years ago

today i use --hide_window when test :d3alg@ubuntu-59:/home/lab/wsy/deep_rl_vizdoom$ CUDA_VISIBLE_DEVICES=4 ./test_a3c.py models/defend_center/example_a3c_center/ACLstmNet/05.30_14-42 -e 10 --hide-window --seed 123 the score is Mean score: 21.45!!

why........when training the --hide-window is default to be true, but when testing with command, the --hide-window is default to be false that's why we can see the screen when testing but not when training .....but how can this setting cause the score gap,,,,,,? can not figure it out....

mihahauke commented 6 years ago

Perhaps there is some bug in my code or the new version or vizdoom. I will investigate it later. Sorry for late response.