A question on computing q_gradients_wrt_actions

matpalm / cartpoleplusplus

3d cartpole gym env using bullet physics trained from pixels with tensorflow LRPG, DDPG & NAF

http://matpalm.com/blog/cartpole_plus_plus/

MIT License

57 stars 14 forks source link

A question on computing q_gradients_wrt_actions #4

Closed pxlong closed 7 years ago

pxlong commented 7 years ago

Hi,

I just read through your DDPG implementation, and it looks awesome. Thanks for sharing!

Currently, I feel confusion about the q_gradients_wrt_actions function, why we add [0] after the returned gradients since we use a batch of actions to compute gradients.

def q_gradients_wrt_actions(self):
    """ gradients for the q.value w.r.t just input_action; used for actor training"""
    return tf.gradients(self.q_value, self.input_action)[0]

Thank you so much.

matpalm commented 7 years ago

it's just a nuance of the tf.gradients API which always returns a list, even when you're only asking for a single value...

from the docs

gradients(ys, xs) adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.

pxlong commented 7 years ago

@matpalm but I suppose we do need a list as the self.input_action (and self.q_value) is a batch of actions sampled from replay buffer. Why do we only ask for a single value? Sorry for bothering.

matpalm commented 7 years ago

i think you're confusing two things...

this method always returns a list of tensors, even when xs is a single element.
we usually use the first dimension of a tensor to represent a batch.

i'm calling tf.gradients with one xs value, input_action, so the return value is a list of one tensor. the [0] in my code is referencing the first element of this returned list, it has nothing to do with the shape of the returned tensor.

have a look at the doc again.