rmst / ddpg

TensorFlow implementation of the DDPG algorithm from the paper Continuous Control with Deep Reinforcement Learning (ICLR 2016)
MIT License
209 stars 64 forks source link

Well done! But is it working? #1

Closed lsqshr closed 8 years ago

lsqshr commented 8 years ago

Hi,

I was looking for such a repo to understand how to implement ddpg. Thanks for sharing.

I tried the Reacher-v1. However it does not seem to converge. So it this repo currently working or is it still under construction?

Also, have you considered using Keras to make things cleaner?

Cheers!

rmst commented 8 years ago

Hi, thanks!

Yes, there is a bug in ddpg. I'm currently investigating. Another problem with the mujoco envs is that they are not normalized (e.g. in Reacher the dimensions representing the velocities have a 20x higher variance than the other dimensions). Batch normalization would alleviate this but it's not implemented yet either. So the repo is still under construction but I'm super happy to get feedback!

I haven't worked with Keras yet but when I looked into the docs it didn't seem obvious to me how to optimize the policy parameters with respect to the Q-network. In TF this is pretty straightforward because of automatic differentiation.

lsqshr commented 8 years ago

It is great to know the potential problem here. If it is the batch normalisation, then you should definitely try keras with one line. I made an example with keras for discrete vanilla Q network might give you some hints. Looking forward to the working version.

rmst commented 8 years ago

Yes the DQN algortihm is probably easy to implement in Keras because you only have one NN. But in DDPG you also have a NN for the policy which is trained in a nonstandard way (via policy gradients). How would you implement that in Keras?

lsqshr commented 8 years ago

You got me dude. I'm looking into it.

On Jun 16, 2016, at 7:32 PM, Simon Ramstedt notifications@github.com wrote:

Yes the DQN algortihm is probably easy to implement in Keras because you only have one NN. But in DDPG you also have a NN for the policy which is trained in a nonstandard way (via policy gradients). How would you implement that in Keras?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

rmst commented 8 years ago

Hey, just wanted you to know that ddpg is now converging on Reacher-v1. The main problem was the reward/return scaling. In order for ddpg to work the returns have to have a certain magnitude. That is simply a problem of the algorithm. However Deepmind just released a new paper (PopArt) that addresses this issue. Any news regarding Keras?

lsqshr commented 8 years ago

It's great you made it work dude. I think J.Shulman used keras in his modular_rl though it has theano backend ( https://github.com/joschu/modular_rl). Also there is a working version of ddpg in rl_lab (https://github.com/rllab/rllab) they used a similar NN wrapper called lasagne. May worth a look at them for improvement.

Best!

On 1 July 2016 at 08:08, Simon Ramstedt notifications@github.com wrote:

Hey, just wanted you to know that ddpg is now converging https://gym.openai.com/evaluations/eval_jMAmHzFQQnSeclUQ55mU5Q on Reacher-v1. The main problem was the reward/return scaling. In order for ddpg to work the returns have to have a certain magnitude. That is simply a problem of the algorithm. However Deepmind just released a new paper ( PopArt http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/popart.pdf) that addresses this issue. Any news regarding Keras?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SimonRamstedt/ddpg/issues/1#issuecomment-229803087, or mute the thread https://github.com/notifications/unsubscribe/ABfshYeDVArUFAMGj30299BailG86rsAks5qRD57gaJpZM4I28mo .

SIQI LIU / PhD Candidate in University of Sydney +61(0)435835978/ sliu4512

[image: Facebook] https://www.facebook.com/siqi.liu.395 [image: Google Plus] https://plus.google.com/113331763673998670565/ [image: Linkedin] http://htmlsig.com/at.linkedin.com/pub/siqi-liu/55/3b4/622/

This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. Company Name is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion and other statement contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.

rmst commented 8 years ago

Thanks

rmst commented 8 years ago

Update: keras-rl might be interesting for you

lsqshr commented 8 years ago

Wonderful job mate!

On 18 August 2016 at 00:55, Simon Ramstedt notifications@github.com wrote:

Update: keras-rl https://github.com/matthiasplappert/keras-rl might be interesting for you

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SimonRamstedt/ddpg/issues/1#issuecomment-240438080, or mute the thread https://github.com/notifications/unsubscribe-auth/ABfshR2dbw4YMQQVuy1Wluzh2YGw8fTGks5qgyDSgaJpZM4I28mo .

SIQI LIU / PhD Candidate in University of Sydney +61(0)435835978/ sliu4512

[image: Facebook] https://www.facebook.com/siqi.liu.395 [image: Google Plus] https://plus.google.com/113331763673998670565/ [image: Linkedin] http://htmlsig.com/at.linkedin.com/pub/siqi-liu/55/3b4/622/

This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. Company Name is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion and other statement contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.