pemami4911 / deep-rl

Collection of Deep Reinforcement Learning algorithms
MIT License
297 stars 193 forks source link

converger issue #2

Closed fangthu closed 7 years ago

fangthu commented 7 years ago

hi, Recently I used the your template to learn some simple maneuvers.

But I find the output always converge to -1 or +1 if the episodes is large enough, and if the output boundary is [-1 1].

Have you ever met with this situation? or do you know how to solve?

Best

Anjum48 commented 7 years ago

I'm running into a similar issue. I tried adding a penalty for a certain number of repeat actions, however after a very large number of episodes it still converges onto an end member. I'd be interested to see if anyone else has a workaround to this issue selection_006

pemami4911 commented 7 years ago

I haven't had a chance to look into this issue, but an initial suggestion is to add a learning rate schedule with tf.train.exponential_decay for both networks. Also, set a desired loss or average reward, and then stop training the networks rather than continue updating the weights once you hit your target. If the networks have learned well enough before you have trained them for X number of episodes, stopping early is recommended to prevent the weights from sliding into some worse local minima.

Anjum48 commented 7 years ago

I think with the Adam optimiser, you don't need learning decay. I did email David Silver about this, and he said It's usually possible to solve pendulum with bang-bang control - so if it's stabilising and achieving desired reward, maybe -1 or +1 is okay.

pemami4911 commented 7 years ago

Ah, right. Yeah, this implementation is pretty simple, so it works for a task like pendulum. More tricks and adjusting would definitely be needed for a more complex problem.

GoingMyWay commented 6 years ago

@Anjum48

Hi, may I ask you a question, how can I plot such a chart as you posted with tensorboard.

Anjum48 commented 6 years ago

Hi @GoingMyWay, it's a bit tricky but you can create a histogram for TensorBoard using a Numpy array. I used a custom function which does this (bit hacky, but I haven't found a better way yet) https://github.com/Anjum48/rl-examples/blob/master/dppg/ddpg.py#L204

Hope this helps!

GoingMyWay commented 6 years ago

@Anjum48

Thank you, BTW, how many episodes will it take to train Pendulum-v0, I trained it with 10k episodes, but it doesn't converge now.

Anjum48 commented 6 years ago

I found that in my implementation of DDPG (which is pretty similar to how @pemami4911 did it), it converges after 100-200 episodes (FYI, I can't get it to learn this fast with other algorithms e.g. A3C or PPO).

image

In my experience, DDPG is very sensitive to how the OU noise is added to the actions, so I added an exponential decay like this:

epsilon = np.exp(-i/TAU2)
a += epsilon * exploration_noise.noise() / env.action_space.high

with TAU2 = 25 (this should be dependent on the environment). An interesting area of research which I still need to try is adding noise to the network parameters rather than the actions (see https://github.com/openai/baselines/tree/master/baselines/ddpg)

GoingMyWay commented 6 years ago

@Anjum48

Thank you, I will run your code. The results from pemami4911's code is

image

From curves of avg max Q and reward, I couldn't tell if it converges or not.

Anjum48 commented 6 years ago

@GoingMyWay I suspect that it is converging, but because the noise term is still being added to the actions (i.e. it hasn't decayed to zero after learning), the actions are too noisy to get a smooth looking reward curve. For example, the Pendulum might be nicely balanced in the upright position, but the random noise added to the actions will knock it off balance, hence causing the poor scores