Question? - Githubissues

The constraint Q_c, is this learnt or specified by human? From the code it seems this was learnt as part of the critics.

What does dqda do? I think taking derivative w.r.t. to action seldom happens. I am wondering what purpose does it serve?

Q_c is learned from the constraint reward. There is another Q value Q which is learned from the task/utility reward. Both Q_c and Q are learned by TD learning at the same time.
dqda is used to optimize the policy so that the action sampled from the policy maximizes the Q value. Note that we consider continuous actions in the paper. This is similar to the policy optimization step in the SAC paper:

You can think of the Q value being fixed when optimizing the policy \pi_\theta. With the reparameterization trick, we need to take the derivative of a w.r.t. Q, and chain it with the derivative of \theta w.r.t. a.

hnyu / seditor

Question? #1