hnyu / seditor

Code release for the paper "Towards Safe Reinforcement Learning with a Safety Editor Policy", Yu et al., arXiv 2022
13 stars 1 forks source link

Question? #1

Closed zjplab closed 2 years ago

zjplab commented 2 years ago
  1. The constraint Q_c, is this learnt or specified by human? From the code it seems this was learnt as part of the critics.
  2. What does dqda do? I think taking derivative w.r.t. to action seldom happens. I am wondering what purpose does it serve?
hnyu commented 2 years ago
  1. The constraint Q_c, is this learnt or specified by human? From the code it seems this was learnt as part of the critics.
  2. What does dqda do? I think taking derivative w.r.t. to action seldom happens. I am wondering what purpose does it serve?
  1. Q_c is learned from the constraint reward. There is another Q value Q which is learned from the task/utility reward. Both Q_c and Q are learned by TD learning at the same time.
  2. dqda is used to optimize the policy so that the action sampled from the policy maximizes the Q value. Note that we consider continuous actions in the paper. This is similar to the policy optimization step in the SAC paper:

image

You can think of the Q value being fixed when optimizing the policy \pi_\theta. With the reparameterization trick, we need to take the derivative of a w.r.t. Q, and chain it with the derivative of \theta w.r.t. a.