Closed stokasto closed 8 years ago
And I win for misspelling function in the issue title ;).
@stokasto, probably not for 1. :
look closely at :
def sampled_actions_log_probs(self): return F.select_item( self.log_probs, chainer.Variable(np.asarray(self.action_indices, dtype=np.int32)))
I assume the function is equivalent to multiply by one-hot encoded target.
@etienne87 Ah yes, that is fine then. As I wrote I did not overly carefully study the code during my re-implementation but just thought that I would bring these two issues up since (in case they are true) they are hard to detect and could be lurking in the code-base for a while. Thank you for falsifying the first issue.
Thank you for your comments.
F.select_item
returns only the corresponding log probability of the selected action (which is stored in self.action_indices
of SoftmaxPolicyOutput
).float(advantage.data)
is what makes sure the gradients won't flow throught v
. advantage
is a chainer.Variable
, which can back-propagate gradients, but float(advantage.data)
is literally just a python float
and no gradients will be back-propagated through it.Hey, a great that explains it, I was wondering how the implementation could have converged to a reasonable value function if that error were indeed there. As I said I am not experienced with chainer and so did not get this little interaction. My results also look fairly similar to the graphs you have posted. I'll upload it later and link to your implementation. Feel free to close this issue.
Hey, first off: great work. I just re-implemented the paper myself using tensorflow and your code provided great "side-information" for doing so :). In the process I also realized that there may be two subtle bugs in your implementation (although I have never used chainer before so I might be misunderstanding things):
pi_loss -= log_prob * float(advantage.data)
I believe this is incorrect as you should multiply log_prob with a one-hot encoding of the action (since only one of the actions was selected)
advantage = R - v
where
v
comes fromself.past_values[i]
which, in turn, is the output of the value network. As I wrote I am no expert regarding chainer but you need to make sure that no gradient flows throughv
here (as the value function should only be updated according to thev_loss
in your code). In theano/tensorflow this would be handled with adisconnected_grad()
orstop_gradient()
operation respectively.I will push my implementation to github sometime during this week as soon as I have more thoroughly tested it and can then reference it here for you to compare.