intelligent-control-lab / guard

MIT License
39 stars 6 forks source link

Question related to state-constraint satisfication of SCPO? #6

Closed Haihan-W closed 6 months ago

Haihan-W commented 6 months ago

image

https://github.com/intelligent-control-lab/guard/assets/33006435/cd88cb56-0d54-4798-b354-b39f6f568773

https://github.com/intelligent-control-lab/guard/assets/33006435/590d80aa-8c6a-48cd-953b-b230be3d08ee

CaesarAndylaw commented 6 months ago

Hi Haihan, your question is insightful! In our paper, we establish that, during policy optimization, by constraining the right-hand side (RHS) of equation (13) to be smaller than $wi$, the state-wise cost $J{D_i}$ of the new policy will be consistently smaller than $w_i$ at each iteration during training. It's crucial to note that there's no need to prove the RHS of (13) converges to 0.

It's important to understand that the computation of the RHS of (13) involves the advantage term $A_D^\pi$. Theoretically, with the ground-truth value of $AD^\pi$ for the current policy, the constrained policy optimization ensures that the new policy's state-wise cost $J{D_i} \leq w_i$.

However, in practical implementation, the advantage is approximated by a neural network, learned from limited experience, introducing a gap between the learned $A_D^\pi$ and the ground truth. Additionally, we linearize the objective and constraints during policy optimization, forming it as a Linear Quadratic Constrained Linear Program (LQCLP) (Section 10.2 in CPO: https://arxiv.org/abs/1705.10528) and apply backtracking to balance reward improvement and cost reduction (code lines 433 to 555 in SCPO: https://github.com/intelligent-control-lab/StateWise_Constrained_Policy_Optimization).

In summary: (i) The guarantee holds at each training iteration, not just after convergence. (ii) Violation is attributed to the inherent learning error of the advantage function and solving the LQCLP form of equation (11) during practical implementation. (iii) SCPO ensures state-wise cost satisfaction in expectation, and we are actively working on a much stronger theoretical result called Absolute SCPO (ASCPO), which guarantees high-probability state-wise cost satisfaction. You might find our side project, APO, interesting; it originates from the theoretical result of ASCPO and is a basic RL algorithm ensuring monotonic improvement of the lower probability bound of performance (https://arxiv.org/abs/2310.13230).

Haihan-W commented 6 months ago

Thank you so much for your response! This is really informative and has perfectly solved my question here.