Open ouyangzhuzhu opened 6 years ago
Judging by this, testing is done with deterministic policy by default. I.e. instead of sampling action from the policy's distribution, agent picks the action with highest probability.
When it comes to evaluating the performance of agent during training, people often just print and report the performance during training (the printout you see during training). Reinforcement learning in funny in a sense that your test set is your training set (unless you want to evaluate something specific).
3ks for your explanation , and here u mention the deterministic policy by default when testing ,where does the deterministic policy comes from? Does it come from the well trained model?
The policy comes from the trained model yes. In this case "the policy" (i.e. our agent) outputs probability for each action given a state, and these probabilities correspond to what agent has learned (higher probability = better action).
You can deal with these probabilities at least in two ways: 1) You pick random action according to probabilities provided by the policy (higher probability = higher chance of picking that action). This is non-deterministic, since you won't always pick same action for same state. Or, 2) Pick action which has highest probability. This is deterministic, since you always pick the same action for same state.
The second one might first sound like a better approach: We want to pick best actions in each state, after all. However there are publications and toy examples showing that stochasticity can improve the performance (or is required for high reward). From my personal experiences sampling an action with A3C in ViZDoom is better.
3ks a lot Miffyli for your answers!!!!!
today i use --hide_window when test :d3alg@ubuntu-59:/home/lab/wsy/deep_rl_vizdoom$ CUDA_VISIBLE_DEVICES=4 ./test_a3c.py models/defend_center/example_a3c_center/ACLstmNet/05.30_14-42 -e 10 --hide-window --seed 123 the score is Mean score: 21.45!!
why........when training the --hide-window is default to be true, but when testing with command, the --hide-window is default to be false that's why we can see the screen when testing but not when training .....but how can this setting cause the score gap,,,,,,? can not figure it out....
Perhaps there is some bug in my code or the new version or vizdoom. I will investigate it later. Sorry for late response.
I train the agent with A3C, hi~, is there any differences between the training and testing sets? The agent got mean score by 21 but got only 6 when testing... Is there any differences between the testing method when training and testing?
Here is the training result at epoch 30:
EPOCH 30 TRAIN: 15360164(GlobalSteps), 110 episodes, mean: 21.682±1.97, min: 11.000, max: 25.000, LocalSpd: 282 STEPS/s GlobalSpd: 524 STEPS/s, 1.89M STEPS/hour, total elapsed time: 8h 8m 44s TEST: mean: 23.040±1.50, min: 18.000, max: 25.000, test time: 1m 10s Learning rate: 2.38661577896e-05 Saving model to: models/defend_center/example_a3c_center/ACLstmNet/05.30_14-42/model Time: 22:52
Here is the testing result for 10 episodes:
d3alg@ubuntu-59:/home/lab/wsy/deep_rl_vizdoom$ CUDA_VISIBLE_DEVICES=4 ./test_a3c.py models/defend_center/example_a3c_center/ACLstmNet/05.30_14-42 -e 10 --seed 123 Mean score: 6.200