Closed BigBadBurrow closed 4 years ago
Refer to #6
Yeah, I realised that, but an agent will still learn with epoch = 1. Sooo.... I guess in that case it'd just be using the critic / advantage aspect rather than anything from policy optimization? Nice proof of concept thought that it can learn using only the critic part.
I've been looking over the code to get a better grasp of what it's doing, and the one thing that confuses me is in the
update()
method, why ratios aren't always 1?The log probabilities are stored in memory which were obtained from
policy_old
, and then inupdate()
it gets the log probabilities frompolicy
via theevaluate()
method, and theexp
difference between them is the ratio. Afterwardspolicy_old
weights are updated frompolicy
, so they're the same. But if the same state is fed into exact copies ofpolicy
, then I don't understand why they'd produce different log probabilities? I'm obviously missing a piece of the puzzle, but I can't think what it is.