Open abhigenie92 opened 7 years ago
opt = Adam(lr=self.learning_rate)
model.compile(loss='categorical_crossentropy', optimizer=opt)
First, I think the loss should be gradient = (y-prob)*reward. Second, we already set the learning_rate of opt.
So, Y should be self.probs + np.vstack([gradients]) ? Y-Y_predict = Y - self.probs = np.vstack([gradients])
I do understand the backpropagation in policy gradient networks, but am not sure how your code work keras's auto-differentiation.
That is, how you transform it into a supervised learning problem. For example, the code below:
Why is Y not 1-hot vector for the action taken? You compute the gradient assuming the action is correct, Y is one-hot vector. Then you multiplies it by the reward in the corresponding time-step. But while training you feed it as the correction. I think one could multiply the rewards by one-hot vector instead. And then feed it straight away.
If possible please clarify my doubt. :) https://github.com/keon/policy-gradient/blob/master/pg.py#L67