openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.8k stars 4.88k forks source link

Confused about Her+DDPG policy-loss #492

Closed astier closed 6 years ago

astier commented 6 years ago

The policy-loss in the her+ddpg implementation is defined as following:

self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))

This can be found here: https://github.com/openai/baselines/blob/f2729693253c0ef4d4086231d36e0a4307ec1cb3/baselines/her/ddpg.py#L274

I understand why we are using the first part: self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf) However, I do not understand the purpose of the second part (I call it from now on mean_sqr_action): self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))

The second part is never mentioned in the paper as far as I know and also isn't used in your vanilla implementation of your baseline-ddpg. Additionally, in my two experiments it improved the learning significantly when removing the mean_sqr_action from the loss function.

The first experiment was in the environment FetchReach-v1 with the default settings. In this case it turns out that the mean_sqr_action-implementation needs 5 epochs to reach a success rate of 1 whereas the modified version needs only 3 epochs.

In the more complex environment HandReach-v0 the mean_sqr_action-version needed 20 epochs to reach an accuracy of 0.4 whereas the modified-version could achieve an accuracy of 0.5 already after 11 epochs and after 20 epochs an accuracy of 0.55.

kirk86 commented 6 years ago

I also don't understand why there are two different implementations of ddpg? And apparently not only they are different but they also use different experience replay mechanisms. The baslines.ddpg uses a ring buffer while the baselines.her.ddpg uses a replay buffer. Also in the baselines.ddpg I don't understand why they have to create copies of the actor and critic for the target actor and target critic and not simply instantiate new objects as required?

astier commented 6 years ago

I don't understand why they have to create copies of the actor and critic for the target actor and target critic and not simply instantiate new objects as required?

Why do you think its required? In the DDPG paper the algorithm is defined such that the target networks are just the copies and then slowly try to catch up with the non-target networks by applying soft-target-updates.

astier commented 6 years ago

Oh ok. I assume the following term is just a regularization-term for the actions. self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u)) Hence the variable self.action_l2. Could somebody confirm or disprove my statement?

kirk86 commented 6 years ago

DDPG paper the algorithm is defined such that the target networks are just the copies and then slowly try to catch up with the non-target networks by applying soft-target-updates

Cool thanks, that kind of make sense. BTW are you referring to this https://arxiv.org/abs/1509.02971 as the ddpg paper?

astier commented 6 years ago

Yes.

JBLanier commented 6 years ago

Referring to the actor loss component: https://github.com/openai/baselines/blob/f2729693253c0ef4d4086231d36e0a4307ec1cb3/baselines/her/ddpg.py#L274 what would be a reason to want to penalize the magnitude of actions, as is done here?

filipolszewski commented 6 years ago

In the HER paper it is motivated in this way:

In order to prevent tanh saturation and vanishing gradients we add the square of the their preactivations to the actor’s cost function

Edit: As I haven't learned Tensorflow so I am not sure, but does the line of code takes mean of the square of the actions and not the tanh preactivations? Because I think it uses the output of actor network and this seems inconsistent with the paper.

filipolszewski commented 6 years ago

I think this is not elegant to make quite a change to the actor loss function and mention it only in the 'technical details' of the paper, while using the term vanilla DDPG accross the whole publication.

Oh, and if the actual vanilla DDPG were used in the paper, it would be outraging to make such a change only for the HER version, no need for explanation here...

Also, I am working on my own implementation of DDPG+HER for solving the Fetch environments, and I have noticed that actor is performing really unstable actions during learning - usually moving the arm away from the table, putting it up in the air etc. After some initial experimentation, it seems like the regularization of preactivations helps to keep it focused on the table a bit more, which might be a big improvement. As I am still working on the code, so I will edit this comment when I will be able to come up with some data-proven statements.