DDPG implementation correctness

Hi! I'm trying to implement DDPG as well based on paper Continuous control with deep reinforcement learning. Though without much success yet... So I was looking for other's implementations and found your repo :)

Looking through your code (this repo and ipython-notebooks) I found some differences with my understanding and implementation of the paper. Probably because my implementation is not working very well, your code is more correct, but I'd like to clarify several points.

Let's start with this repo and Tensorflow implementation.

rlcomp/dpg.py: __makegraph

# Build tracking models.
self.a_pred_track = policy_model(self.inputs, self.mdp_spec, self.spec,
                                 track_scope="%s/policy" % self.name,
                                 name="policy_track")
self.critic_on_track = critic_model(self.inputs, self.a_pred, self.mdp_spec,
                                    self.spec, name="critic_track",
                                    track_scope="%s/critic" % self.name)

If I understand correctly track term here is just synonym of target in the paper and tracking models are just copies of current models which weights are updated with some delay. These models are used to calculate new Q values (q_targets in your code) So my understanding is that critic_on_track should depend on a_pred_track rather than on a_pred, because in formula above both Q' and mu' are target versions of Q and mu.

rlcomp/dpg.py: __makeobjectives

# Critic objective: minimize MSE of off-policy Q-value predictions
q_errors = tf.square(self.critic_off - self.q_targets)
self.critic_objective = tf.reduce_mean(q_errors)

Comparing to the paper: As seen in formula for critic loss they use which is not the same as critic_off in your implementation because critic_off doesn't use actions _ai sampled from replay buffer, but instead it depends on a_explore that generate new actions each time a_i = a_explore = mu(s_i) + noise.

rlcomp/tasks/cartpole.py: _trainbatch

# Sample a training minibatch.
b_states, b_actions, b_rewards, b_states_next =  buffer.sample(FLAGS.batch_size)

# Compute targets (TD error backups) given current Q function.
a_next, q_next = sess.run([dpg.a_pred_track, dpg.critic_on_track],
                          {dpg.inputs: b_states_next})
b_targets = b_rewards + FLAGS.gamma * q_next.flatten()

# Policy update.
sess.run(policy_update, {dpg.inputs: b_states})

# Critic update.
cost_t, _, q_pred = sess.run([dpg.critic_objective, critic_update, dpg.critic_off],
                             {dpg.inputs: b_states, dpg.q_targets: b_targets})

This piece of code I just included to be able to see everything in one place. You can see that b_actions are not used. It looks suspicious to me...

Here is theirs algorithm for convenience:

Then I found your Theano-based implementation :) Let's look there.

DPG for CartPole.ipynb: _trainbatch

# Sample a training minibatch.
idxs = np.random.choice(len(R_states) - 1, size=batch_size, replace=False)
b_states, b_actions, b_rewards, b_states_next = \
    R_states[idxs], R_actions[idxs], R_rewards[idxs], R_states_next[idxs]

# Compute targets (TD error backups) given current Q function.
next_actions = dpg.track.f_action_on(b_states_next)
b_targets = b_rewards + gamma * dpg.track.f_q(b_states_next, next_actions).reshape((-1,))

# SGD update.
cost_t, _ = dpg.f_update(b_states, b_targets, lr_actor, lr_critic)
# Update tracking model.
dpg.track.f_track_update(tau)

As in previous implementation b_actions are not used. But unlike it here you use both tracking versions of actor and critic models for predicting new Q values, that seems correct.

DPG for CartPole.ipynb: __makeupdates

# Actor-critic learning w/ critic 3
# NB, need to flatten all timesteps into a single batch
self.updates = OrderedDict(self.critic_exp.updates)
# Add policy gradient updates
self.updates.update(SGD(-self.critic_exp.pred.mean(),
                         self._vs_actor.vars.values(),
                         self.lr_actor))
# Target network: update w.r.t. parent
if self.parent is not None:
    self.target_updates = OrderedDict()
    for vs, parent_vs in [(self._vs_actor, self.parent._vs_actor),
                          (self._vs_critic, self.parent._vs_critic)]:
        for param_name, param_var in vs.vars.iteritems():
            self.target_updates[param_var] = (
                self.tau * vs.vars[param_name]
                + (1 - self.tau) * parent_vs.vars[param_name])

Still same question: why critic_exp are used in cost calculation? I assume critic_given should be used with actions sampled from replay buffer. Please, correct me if I'm wrong.

Another issue that I see, is in updating target network. As I understand, in this case self.parent is current model from which weights should gradually be copied to target which is self in this case. So if self.tau is small like 0.001 then you should take 0.999 of previous target value + 0.001 of new value: (1 - self.tau) * vs.vars[param_name] + self.tau * parent_vs.vars[param_name])

That's all my findings. I'm quite new to TF and Theano and RL, so they all could be wrong :) But I really would like hear you comments on them. As I said, my implementation doesn't converge to something meaningful, so I'm wondering whether this is due to misinterpreting the paper or may be I just need to tune some hyperparams. Did you achieve good results, BTW?

Hi @stas-sl , Sorry for the late reply — just getting back into DDPG after a holiday break.

I think you are right on all counts! It is very helpful to have a second pair of eyes check my code. You've found quite a few genuine bugs, I think. I am embarrassed but very grateful :)

If you feel so motivated, I would happily accept a PR on this project. I haven't been focusing on maintaining the DPG implementation much while I work on PointerNetDPG. If you don't have the time, I'll try to get to fixing + re-testing things eventually..

Did you achieve good results, BTW?

I was not able to achieve good results with the Theano implementation, which is part of the reason why I moved to a TF implementation. In TF I got moderately good* CartPole results after a long manual hyperparameter search, and at this point I dropped everything and proceeded to the PointerNetDPG (the main thrust of this GitHub project). It is likely that my search would have been less painful if I had found these bugs that you are demonstrating!

"moderately good" = the cart would swing the pole up repeatedly and stay near the center of the environment. Here "excellent" would be a behavior where the cart would swing up and then perfectly balance the pole.

Another detail, I remembered, is that they use Ornstein-Uhlenbeck process for exploration and batch normalisation. But that may be not as critical for simple problems.

Right. I avoided OU exploration for this simple task, just because I didn't want unnecessary complexity in my proof-of-concept setting. I have tried batch normalization within the PointerNetDPG (BN of action representations), but it hasn't helped much. There's lots of room for exploration here — no pun intended..

hans / rlcomp

DDPG implementation correctness #1