Change sign of entropy penalty for punishment

Entropy should be added to overall loss with + sign, since we want add it as penalty to become more steepest in our probability distribution within actions.

entropy = - tf.reduce_sum(prob_tf * log_prob_tf)

log probabilities above should has negative value in the brackets and we also add one more - before sum to have some positive value in total.

So,

self.loss = pi_loss + 0.5 * vf_loss - entropy * 0.01

when we sum up all of penalties we also should add entropy as penalty.

The right formula should like as follows to my mind:

self.loss = pi_loss + 0.5 * vf_loss + entropy * 0.01

openai / universe-starter-agent

Change sign of entropy penalty for punishment #110