Bug of PPO - Githubissues

tensorlayer / TensorLayer

Deep Learning and Reinforcement Learning Library for Scientists and Engineers

http://tensorlayerx.com

Other

7.31k stars 1.61k forks source link

Bug of PPO #1072

Open GIS-PuppetMaster opened 4 years ago

GIS-PuppetMaster commented 4 years ago

ratio = tf.exp(pi.log_prob(action) - old_pi.log_prob(action)) surr = ratio adv ... loss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - self.epsilon, 1. + self.epsilon) adv) )

should use ratio in tf.minimum rather than surr, because surr=ration*adv, and there could be negative value in adv, so the result of tf.minimum may contain a value like -1e10, and cause actor's loss failed.

GIS-PuppetMaster commented 4 years ago

it should be like this: self.cliped_ratio = tf.clip_by_value(self.ratio, 1. - METHOD['epsilon'],

- METHOD['epsilon']) self.min_temp = tf.minimum(self.ratio, self.cliped_ratio) self.aloss = -tf.reduce_mean(self.min_temp * self.tfadv)

quantumiracle commented 4 years ago

Why the negative value causes failure in actor loss? You can also refer to OpenAI baselines here, which has similar process as our repo.

GIS-PuppetMaster commented 4 years ago

Why the negative value causes failure in actor loss? You can also refer to OpenAI baselines here, which has similar process as our repo.

I drawed the loss polt and reward plot, when there is a very small negative value, such as 1e-10, the loss will be extremly larger than normally, and the reward stoped increase. I just tried lower learning rate, and there was no such 1e-10 value came out. I wonder if it's the same that use my code above, since it's more robust.

quantumiracle commented 3 years ago

Sorry for the late reply. What you mentioned might be caused by some numerical issues in tf.minimum if I understood correctly. Could you please print out an example case and paste it here? I'm a bit confused by your description since you mentioned both large negative value (-1e10) and small positive value (1e-10). A case showing how it causes a large loss value would be great.