rlcode / reinforcement-learning

Minimal and Clean Reinforcement Learning Examples
MIT License
3.37k stars 728 forks source link

Pong Policy Gradient-important error in the definition of the convolutional net. #79

Open TomaszRem opened 6 years ago

TomaszRem commented 6 years ago

I tried to run Pong Policy Gradient for 2000 episodes on the original file with no results whatsoever. Then boosted reward for positive points (points scored by the learner(right side) to 20 and got this result: pong_reinforce_v1 02x20x-1 I boosted learner's points rewards to 100 and after around 1500 episodes got a slight improvement, similar to that in the picture. I ran it to 8100 episodes and there was no improvement except for a slightly smaller variance. Forgive my being naive but successfully running three versions of cartpole I was expecting some logical results. As you can see from the picture variance is big and after a 800-900 improvement the results seem stagnant. Has anybody run it for some more episodes and tried to tweak the rewards and brought results up and variance down? Given the policy should I boost the penalty for the teacher's (left opponent's) scoring points? Any guidance will be appreciated. Thanks.

TomaszRem commented 6 years ago

I found the reason behind my issue. Convolutional part of the neural net was wrongly defined that's why it converged to a negative result. Based on my earlier experience with the convolutional networks I changed the following: model.add(Reshape((1, 80, 80), input_shape=(self.state_size,))) to model.add(Reshape((80,80, 1), name="Layer1",input_shape=(self.state_size,))) and removed strides in strides=(3, 3) to the default one. The first change reshaped the network correctly to have 80 by 80 windows not 1 by 80 windows the second change was necessary because without it the network was loosing some information and early converging and not exploring any more. Now the network looks like this: net_pong_reinforce_v1 and after only 1000 episodes it mostly wins although with a high variance and shows a bias to stay in the lower part of the screen. It either needs more training or redefinition of the act function. pong_reinforce_v1 02 01to1050x1x-1 I made some more changes to the structure to speed things up because on my laptop with 1,8 mln weights it was very slow.