Closed chenghuzi closed 7 years ago
Q-values in fact dit decrease. However, the Q statistic (Q_global) shown during training was defined by the highest positive (!) Q-value of the past (~20) steps. Also, Q-values are not calculated in the first iterations of training because the algorithm only "explores" (e=1.0). When epsilon decreases (e->0.1) more steps will become "exploitation" steps and therefore more Q-values will be calculated. The example code on the github README page will show its first positive Q-values around iteration 2000 (30k steps).
FIX: I pushed a fix. The highest Q-values are now shown independent of their sign (also negative). If no Q-values were calculated (during exploration in the beginning) the statistic will be NaN.
OK, I got it from your thesis. But I have still another problem. In your code, I can't see two networks as you mentioned two Q network in your thesis (section 1.5.1), What I found is that there's only one neural network in code and it updates in every iteration. So I'm confused Why it can work.
Yes, using both a target network and a q network is necessary for a stable learning process (prevent oscillation etc.). However, the network can learn without this extra frozen copy of the network. For better performance, please take a look at the updated version of the DQN code used in this repo. It is mostly the same code but it incorporates both a target or a q-net.
I hope this answers your question.
Thanks a lot. There's still a small question that why not just use
self.reward = state.getScore() - self.lastState.getScore()
rather than
if reward > 20: self.last_reward = 50. # Eat ghost (Yum! Yum!) elif reward > 0: self.last_reward = 10. # Eat food (Yum!) elif reward < -10: self.last_reward = -500. # Get eaten (Ouch!) -500 self.won = False elif reward < 0: self.last_reward = -1. # Punish time (Pff..)
Actually self.reward = state.getScore() - self.lastState.getScore()
is the logical choice as you use the difference in score directly as the reward for the DQN algorithm and you can defenitely do that! Also, by directly using the score you add less domain-knowledge and therefore your algorithm is even more general.
The only reason this code exists is because I was experimenting with different reward functions (to evaluate the performance of different reward values on the performance of the agent).
OK, thanks for your detailed answer!
I run this code and found that the Q Value can't increase. I guess it was because there's no target Q network with fixed parameters.