why recalculate pi and v?

miyosuda / async_deep_reinforce

Asynchronous Methods for Deep Reinforcement Learning

Apache License 2.0

592 stars 192 forks source link

Open joyousrabbit opened 7 years ago

joyousrabbit commented 7 years ago

Hello, in game_ac_network.py, def prepare_loss(self, entropy_beta), you have:

  # temporary difference (R-V) (input for policy)
  self.td = tf.placeholder("float", [None])

  value_loss = 0.5 * tf.nn.l2_loss(self.r - self.v)

But td == self.r-self.v, right?

So, why not use self.td directly instead of recalculating self.v ? Also for pi, why not pass it as placeholder?

Hope reply thanks.

MogicianWu commented 7 years ago

Because self.td is a fed in number(s) used in the policy gradient. You use self.r - self.v to calculate critic losses.