Open GIS-PuppetMaster opened 4 years ago
it should be like this: self.cliped_ratio = tf.clip_by_value(self.ratio, 1. - METHOD['epsilon'],
Why the negative value causes failure in actor loss? You can also refer to OpenAI baselines here, which has similar process as our repo.
Why the negative value causes failure in actor loss? You can also refer to OpenAI baselines here, which has similar process as our repo.
I drawed the loss polt and reward plot, when there is a very small negative value, such as 1e-10, the loss will be extremly larger than normally, and the reward stoped increase. I just tried lower learning rate, and there was no such 1e-10 value came out. I wonder if it's the same that use my code above, since it's more robust.
Sorry for the late reply. What you mentioned might be caused by some numerical issues in tf.minimum if I understood correctly. Could you please print out an example case and paste it here? I'm a bit confused by your description since you mentioned both large negative value (-1e10) and small positive value (1e-10). A case showing how it causes a large loss value would be great.
ratio = tf.exp(pi.log_prob(action) - old_pi.log_prob(action)) surr = ratio adv ... loss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - self.epsilon, 1. + self.epsilon) adv) )
should use ratio in tf.minimum rather than surr, because surr=ration*adv, and there could be negative value in adv, so the result of tf.minimum may contain a value like -1e10, and cause actor's loss failed.