keon / policy-gradient

Minimal Monte Carlo Policy Gradient (REINFORCE) Algorithm Implementation in Keras
MIT License
159 stars 43 forks source link

Loss function/Labels for neural network used? #4

Open abhigenie92 opened 7 years ago

abhigenie92 commented 7 years ago

I do understand the backpropagation in policy gradient networks, but am not sure how your code work keras's auto-differentiation.

That is, how you transform it into a supervised learning problem. For example, the code below:

Y = self.probs + self.learning_rate * np.squeeze(np.vstack([gradients]))

Why is Y not 1-hot vector for the action taken? You compute the gradient assuming the action is correct, Y is one-hot vector. Then you multiplies it by the reward in the corresponding time-step. But while training you feed it as the correction. I think one could multiply the rewards by one-hot vector instead. And then feed it straight away.

If possible please clarify my doubt. :) https://github.com/keon/policy-gradient/blob/master/pg.py#L67

LinkToPast1990 commented 5 years ago
opt = Adam(lr=self.learning_rate)
model.compile(loss='categorical_crossentropy', optimizer=opt)

First, I think the loss should be gradient = (y-prob)*reward. Second, we already set the learning_rate of opt.

So, Y should be self.probs + np.vstack([gradients]) ? Y-Y_predict = Y - self.probs = np.vstack([gradients])

LinkToPast1990 commented 5 years ago

https://github.com/gabrielgarza/openai-gym-policy-gradient/blob/master/policy_gradient_layers.py