Question related to state-constraint satisfication of SCPO?

Haihan-W commented 6 months ago

In the paper https://arxiv.org/pdf/2306.12594.pdf, it claimed the SCPO provides guarantees for state-wise constraint satisfaction in expectation, after convergence of training.
But Proposition 1 in the paper can only prove that J_{Di} is smaller or equal to w_i, and I guess w_i is from Right-hand-side of Eqn (13). (see below screenshot)

In my opinion, to prove the state-wise constraint satisfaction in expection, we also need to prove RHS of Eqn (13) converges to 0 after training. But I didn't find the related proof in the paper regarding this matter in Appendix.
Therefore, I would appreciate any clarification on this matter!
Also I would like to confirm that: The above guarantee is only after training is converged, but not throughout each step of the training, correct?
Another observation is that: for all the experimental tests that I conducted, there still exists constraint violation even after convergence (see below videos for example). I guess it is expected because the above guarantee is only "in expectation", which means I need to run infinite number of tests, in order to see majority of them do not have constraint violation, correct?

https://github.com/intelligent-control-lab/guard/assets/33006435/cd88cb56-0d54-4798-b354-b39f6f568773

https://github.com/intelligent-control-lab/guard/assets/33006435/590d80aa-8c6a-48cd-953b-b230be3d08ee

CaesarAndylaw commented 6 months ago

Hi Haihan, your question is insightful! In our paper, we establish that, during policy optimization, by constraining the right-hand side (RHS) of equation (13) to be smaller than $wi$, the state-wise cost $J{D_i}$ of the new policy will be consistently smaller than $w_i$ at each iteration during training. It's crucial to note that there's no need to prove the RHS of (13) converges to 0.

It's important to understand that the computation of the RHS of (13) involves the advantage term $A_D^\pi$. Theoretically, with the ground-truth value of $AD^\pi$ for the current policy, the constrained policy optimization ensures that the new policy's state-wise cost $J{D_i} \leq w_i$.

However, in practical implementation, the advantage is approximated by a neural network, learned from limited experience, introducing a gap between the learned $A_D^\pi$ and the ground truth. Additionally, we linearize the objective and constraints during policy optimization, forming it as a Linear Quadratic Constrained Linear Program (LQCLP) (Section 10.2 in CPO: https://arxiv.org/abs/1705.10528) and apply backtracking to balance reward improvement and cost reduction (code lines 433 to 555 in SCPO: https://github.com/intelligent-control-lab/StateWise_Constrained_Policy_Optimization).

In summary: (i) The guarantee holds at each training iteration, not just after convergence. (ii) Violation is attributed to the inherent learning error of the advantage function and solving the LQCLP form of equation (11) during practical implementation. (iii) SCPO ensures state-wise cost satisfaction in expectation, and we are actively working on a much stronger theoretical result called Absolute SCPO (ASCPO), which guarantees high-probability state-wise cost satisfaction. You might find our side project, APO, interesting; it originates from the theoretical result of ASCPO and is a basic RL algorithm ensuring monotonic improvement of the lower probability bound of performance (https://arxiv.org/abs/2310.13230).

Haihan-W commented 6 months ago

Thank you so much for your response! This is really informative and has perfectly solved my question here.

intelligent-control-lab / guard

Question related to state-constraint satisfication of SCPO? #6