muupan / async-rl

Replicating "Asynchronous Methods for Deep Reinforcement Learning" (http://arxiv.org/abs/1602.01783)
MIT License
401 stars 83 forks source link

Potential errors in the loss funktion #12

Closed stokasto closed 8 years ago

stokasto commented 8 years ago

Hey, first off: great work. I just re-implemented the paper myself using tensorflow and your code provided great "side-information" for doing so :). In the process I also realized that there may be two subtle bugs in your implementation (although I have never used chainer before so I might be misunderstanding things):

  1. You use the log_probabilities directly when computing the loss for the actor #103 a3c.py :
    pi_loss -= log_prob * float(advantage.data)
    I believe this is incorrect as you should multiply log_prob with a one-hot encoding of the action (since only one of the actions was selected)
  2. To compute the advantage you use the form #97 a3c.py:
    advantage = R - v
    where v comes from self.past_values[i] which, in turn, is the output of the value network. As I wrote I am no expert regarding chainer but you need to make sure that no gradient flows through v here (as the value function should only be updated according to the v_loss in your code). In theano/tensorflow this would be handled with a disconnected_grad() or stop_gradient() operation respectively.

I will push my implementation to github sometime during this week as soon as I have more thoroughly tested it and can then reference it here for you to compare.

stokasto commented 8 years ago

And I win for misspelling function in the issue title ;).

etienne87 commented 8 years ago

@stokasto, probably not for 1. :

look closely at :

#L54 policy_output.py :

def sampled_actions_log_probs(self): return F.select_item( self.log_probs, chainer.Variable(np.asarray(self.action_indices, dtype=np.int32)))

I assume the function is equivalent to multiply by one-hot encoded target.

stokasto commented 8 years ago

@etienne87 Ah yes, that is fine then. As I wrote I did not overly carefully study the code during my re-implementation but just thought that I would bring these two issues up since (in case they are true) they are hard to detect and could be lurking in the code-base for a while. Thank you for falsifying the first issue.

muupan commented 8 years ago

Thank you for your comments.

  1. As @etienne87 wrote, F.select_item returns only the corresponding log probability of the selected action (which is stored in self.action_indices of SoftmaxPolicyOutput).
  2. float(advantage.data) is what makes sure the gradients won't flow throught v. advantage is a chainer.Variable, which can back-propagate gradients, but float(advantage.data) is literally just a python float and no gradients will be back-propagated through it.
stokasto commented 8 years ago

Hey, a great that explains it, I was wondering how the implementation could have converged to a reasonable value function if that error were indeed there. As I said I am not experienced with chainer and so did not get this little interaction. My results also look fairly similar to the graphs you have posted. I'll upload it later and link to your implementation. Feel free to close this issue.