Closed zjplab closed 2 years ago
- The constraint
Q_c
, is this learnt or specified by human? From the code it seems this was learnt as part of the critics.- What does
dqda
do? I think taking derivative w.r.t. to action seldom happens. I am wondering what purpose does it serve?
Q_c
is learned from the constraint reward. There is another Q value Q
which is learned from the task/utility reward. Both Q_c
and Q
are learned by TD learning at the same time.dqda
is used to optimize the policy so that the action sampled from the policy maximizes the Q value. Note that we consider continuous actions in the paper. This is similar to the policy optimization step in the SAC paper:You can think of the Q value being fixed when optimizing the policy \pi_\theta
. With the reparameterization trick, we need to take the derivative of a
w.r.t. Q
, and chain it with the derivative of \theta
w.r.t. a
.
Q_c
, is this learnt or specified by human? From the code it seems this was learnt as part of the critics.dqda
do? I think taking derivative w.r.t. to action seldom happens. I am wondering what purpose does it serve?