Doesn't learn - Githubissues

tychovdo / PacmanDQN

Deep Reinforcement Learning in Pac-man

279 stars 122 forks source link

Doesn't learn #1

Closed chenghuzi closed 7 years ago

chenghuzi commented 7 years ago

I run this code and found that the Q Value can't increase. I guess it was because there's no target Q network with fixed parameters.

tychovdo commented 7 years ago

Q-values in fact dit decrease. However, the Q statistic (Q_global) shown during training was defined by the highest positive (!) Q-value of the past (~20) steps. Also, Q-values are not calculated in the first iterations of training because the algorithm only "explores" (e=1.0). When epsilon decreases (e->0.1) more steps will become "exploitation" steps and therefore more Q-values will be calculated. The example code on the github README page will show its first positive Q-values around iteration 2000 (30k steps).

FIX: I pushed a fix. The highest Q-values are now shown independent of their sign (also negative). If no Q-values were calculated (during exploration in the beginning) the statistic will be NaN.

chenghuzi commented 7 years ago

OK, I got it from your thesis. But I have still another problem. In your code, I can't see two networks as you mentioned two Q network in your thesis (section 1.5.1), What I found is that there's only one neural network in code and it updates in every iteration. So I'm confused Why it can work.

tychovdo commented 7 years ago

Yes, using both a target network and a q network is necessary for a stable learning process (prevent oscillation etc.). However, the network can learn without this extra frozen copy of the network. For better performance, please take a look at the updated version of the DQN code used in this repo. It is mostly the same code but it incorporates both a target or a q-net.

I hope this answers your question.

chenghuzi commented 7 years ago

Thanks a lot. There's still a small question that why not just use self.reward = state.getScore() - self.lastState.getScore() rather than

            if reward > 20:
                self.last_reward = 50.    # Eat ghost   (Yum! Yum!)
            elif reward > 0:
                self.last_reward = 10.    # Eat food    (Yum!)
            elif reward < -10:
                self.last_reward = -500.  # Get eaten   (Ouch!) -500
                self.won = False
            elif reward < 0:
                self.last_reward = -1.    # Punish time (Pff..)

tychovdo commented 7 years ago

Actually self.reward = state.getScore() - self.lastState.getScore() is the logical choice as you use the difference in score directly as the reward for the DQN algorithm and you can defenitely do that! Also, by directly using the score you add less domain-knowledge and therefore your algorithm is even more general.

The only reason this code exists is because I was experimenting with different reward functions (to evaluate the performance of different reward values on the performance of the agent).

chenghuzi commented 7 years ago

OK, thanks for your detailed answer！