Closed BigBadBurrow closed 4 years ago
Hey, I would suggest you to store the regrets as negative rewards, i.e. while appending the rewards, append -regret
. Or in the update()
function do rewards = - rewards
.
The choice of activation function depends on the environment, I have found that Tanh
performs slightly better than ReLU
and most other PPO implementations also use Tanh
.
Sorry, yes of course. I think I need more sleep ha ha
Hello, thank you for such a clear example of PPO in PyTorch. I wonder if you might know how the
update()
method might be modified to minimize rather than maximize? In my case I want to minimize a regret factor, rather than maximize a reward. Many thanks.Another question; why use
Tanh()
activation instead ofReLU()
?