Penalty term in the actor loss

JakobThumm commented 1 year ago

Hi, in https://github.com/mahaitongdae/Feasible-Actor-Critic/blob/54c20dfddc4d3679ab793baf67f70452207d801b/learners/sac.py#L400 you calculate the cost penalty for the actor loss as $\lambda(s_t) \cdot Q_c(s_t, a_t)$. Whereas, in your paper in eq. 4.2, you define the cost penalty as $\lambda(s_t) \cdot (Q_c(s_t, a_t) - d)$. Is there any reason you omitted the cost limit in the actor loss? Do you want to train a policy that causes as few cost occurrences as possible? I noticed that the $\lambda$ MLP correctly has the cost limits in its loss.

mahaitongdae commented 1 year ago

Hi, thanks for your question! The reason for omitting \lambda*d in actor loss is that it doesn't have gradients on the actor network parameters. So adding it and not adding it will be the same in the gradient calculation.

JakobThumm commented 1 year ago

Thank you for clarifying!

mahaitongdae / Feasible-Actor-Critic

Penalty term in the actor loss #3