Open stas-sl opened 8 years ago
Another detail, I remembered, is that they use Ornstein-Uhlenbeck process for exploration and batch normalisation. But that may be not as critical for simple problems.
Hi @stas-sl , Sorry for the late reply — just getting back into DDPG after a holiday break.
I think you are right on all counts! It is very helpful to have a second pair of eyes check my code. You've found quite a few genuine bugs, I think. I am embarrassed but very grateful :)
If you feel so motivated, I would happily accept a PR on this project. I haven't been focusing on maintaining the DPG
implementation much while I work on PointerNetDPG
. If you don't have the time, I'll try to get to fixing + re-testing things eventually..
Did you achieve good results, BTW?
I was not able to achieve good results with the Theano implementation, which is part of the reason why I moved to a TF implementation. In TF I got moderately good* CartPole results after a long manual hyperparameter search, and at this point I dropped everything and proceeded to the PointerNetDPG
(the main thrust of this GitHub project). It is likely that my search would have been less painful if I had found these bugs that you are demonstrating!
Another detail, I remembered, is that they use Ornstein-Uhlenbeck process for exploration and batch normalisation. But that may be not as critical for simple problems.
Right. I avoided OU exploration for this simple task, just because I didn't want unnecessary complexity in my proof-of-concept setting. I have tried batch normalization within the PointerNetDPG
(BN of action representations), but it hasn't helped much. There's lots of room for exploration here — no pun intended..
Hi! I'm trying to implement DDPG as well based on paper Continuous control with deep reinforcement learning. Though without much success yet... So I was looking for other's implementations and found your repo :)
Looking through your code (this repo and ipython-notebooks) I found some differences with my understanding and implementation of the paper. Probably because my implementation is not working very well, your code is more correct, but I'd like to clarify several points.
Let's start with this repo and Tensorflow implementation.
rlcomp/dpg.py: __makegraph
If I understand correctly track term here is just synonym of target in the paper and tracking models are just copies of current models which weights are updated with some delay. These models are used to calculate new Q values (
q_targets
in your code) So my understanding is thatcritic_on_track
should depend ona_pred_track
rather than ona_pred
, because in formula above both Q' and mu' are target versions of Q and mu.rlcomp/dpg.py: __makeobjectives
Comparing to the paper: As seen in formula for critic loss they use which is not the same as
critic_off
in your implementation becausecritic_off
doesn't use actions _ai sampled from replay buffer, but instead it depends ona_explore
that generate new actions each timea_i = a_explore = mu(s_i) + noise
.rlcomp/tasks/cartpole.py: _trainbatch
This piece of code I just included to be able to see everything in one place. You can see that
b_actions
are not used. It looks suspicious to me...Here is theirs algorithm for convenience:
Then I found your Theano-based implementation :) Let's look there.
DPG for CartPole.ipynb: _trainbatch
As in previous implementation
b_actions
are not used. But unlike it here you use both tracking versions of actor and critic models for predicting new Q values, that seems correct.DPG for CartPole.ipynb: __makeupdates
Still same question: why
critic_exp
are used in cost calculation? I assumecritic_given
should be used with actions sampled from replay buffer. Please, correct me if I'm wrong.Another issue that I see, is in updating target network. As I understand, in this case
self.parent
is current model from which weights should gradually be copied to target which isself
in this case. So ifself.tau
is small like 0.001 then you should take 0.999 of previous target value + 0.001 of new value:(1 - self.tau) * vs.vars[param_name] + self.tau * parent_vs.vars[param_name])
That's all my findings. I'm quite new to TF and Theano and RL, so they all could be wrong :) But I really would like hear you comments on them. As I said, my implementation doesn't converge to something meaningful, so I'm wondering whether this is due to misinterpreting the paper or may be I just need to tune some hyperparams. Did you achieve good results, BTW?