Closed Haihan-W closed 6 months ago
Hi Haihan, we suspect the difference is attributed to the upgrade from Mujoco-py to Mujoco3. The original experiments in SCPO paper run on Mujoco-py, which has been deprecated. We are currently running repeat experiments of Pillar task in new GUARD. We will keep you posted of the result after our initial testing.
In the meanwhile, you may want to continue the experiments by tuning the target_cost with wider range of choices. This target_cost represents the scale of RHS of (13) (https://arxiv.org/pdf/2306.12594.pdf) on different tasks. Try to use -0.05, -0.1, -0.2, -0.5 as such.
Thank you Weiye for your quick response!
I will give it a try using different target_cost and also looking forward to the results of your testing.
Just a clarification, not sure if I am understanding it correctly, but should the target_cost a non-negative number? In this case, should I try 0.05,0.1, 0.2, 0.5 instead of -0.05, -0.1, -0.2, -0.5 as you suggested?
Because from my understanding, a zero target_cost ($w_i$) means no state constraint violation is allowed, which is the default option used in the SCPO paper (https://arxiv.org/pdf/2306.12594.pdf), and positive target_cost means some violation of state constraints is allowed (relaxed condition). Then what does negative target_cost (e.g. -0.05, -0.1) mean?
Thank you!
Hi Haihan, sorry for the late response. (1) Constraints of the original optimization problem are $E[KL]\leq\delta,~J_{D_i}+E[A_D+2(H+1)\epsilon\sqrt{0.5KL}]\leq {wi}$. We simplify the implementation of the later term as $J{D_i} + \alpha \leq {w_i}$, where $\alpha$ is a constant, $w_i$ is the target cost. Thus we can set target cost as $w_i - \alpha$ to implement the theory, which means the actual target cost is still the non-negative number (set as zero to align other algorithms). For example, if you set target cost as -0.03, you actually set $\alpha$, the approximation of $E[A_D+2(H+1)\epsilon\sqrt{0.5KL}]$, as 0.03.
(2.1) One important thing to mention that, SCPO-concerning experiments should be run in noconti
environments. The difference caused by whether the environment of the execution is set to noconti
or not is that the environment of noconti
will reset the whole environment after a successful mission (e.g., after reaching the goal, the positions of both the robot and the goal will be randomly reset; however, if it is not set, only the position of the goal will be reset, while the robot's state will remain unchanged until the run to the maximal episode size resets the environment as a whole). The reason for setting noconti
for SCPO is that its theory considers a complete single task run to compute the maximum value. If the next task is run without resetting the environment, the maximum loss value from the previous computation affects that task, causing the theory to fail and the algorithm to inherently fail to produce reasonable results.
In short, your terminal commands should be changed to:
python cpo.py --task Goal_Point_8Hazards_noconti --seed 1 --model_save
python scpo.py --task Goal_Point_8Hazards_noconti --seed 1 --model_save
python cpo.py --task Goal_Point_8Ghosts_noconti --seed 1 --model_save
python scpo.py --task Goal_Point_8Ghosts_noconti --seed 1 --model_save
python cpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save
python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save
(2.2) The change of environment engine indeed casues some unexpected results in the performance of pillar tasks. We tested task Goal_Point_8Pillars_noconti
and get the follow results:
Target cost here is set to -0.03. We can see that SCPO outperforms CPO, although there is something wrong with the Mujoco3 environment. And we will further debug the new Mujoco3 environments and update it as soon as possible.
Hi Feihan,
Thank you so much for your clarification!
Regarding (1): I compared the code with your clarification, and found that, in the code, target_cost is used to compute "c". And c is defined as c=EpMaxCost - target_cost. target_cost is from the user input when running the code. And if I interpret it correctly, EpMaxCost corresponds to J_D term in the 2nd constraint of Eqn 11 in the paper.
Also in the paper, it defined term "c" = J_D+2(H+1)ϵ*sqrt(0.5KL) - wi, and used "c" in the 2nd constraint of convex programming in the pseudo code.
Combining above, I derived $target\_cost = w_i - 2(H+1)ϵ*sqrt(0.5KL)$, (see below), therefore, would the α be $2(H+1)ϵ*sqrt(0.5KL)$, i.e. excluding $A_D$ term that you mentioned previously?
Hi Haihan, sorry for the late response. (1) Constraints of the original optimization problem are E[KL]≤δ, JDi+E[AD+2(H+1)ϵ0.5KL]≤wi. We simplify the implementation of the later term as JDi+α≤wi, where α is a constant, wi is the target cost. Thus we can set target cost as wi−α to implement the theory, which means the actual target cost is still the non-negative number (set as zero to align other algorithms). For example, if you set target cost as -0.03, you actually set α, the approximation of E[AD+2(H+1)ϵ0.5KL], as 0.03.
===========================
Regarding (2.1):
Thanks for pointing it out! I investigate the code in safe_rl_env_config.py
and found that noconti
specified in the task name impacts the config['continue_goal']
value, if the noconti
is included, it will make config['continue_goal']
equal to False.
But in the engine.py
file, it seems that if continue_goal
is TRUE, it will update layout (with new position of objects) as well as build a new goal, but if continue_goal
is FALSE (i.e. noconti
case), it simply update self.done = True. Therefore I am a bit confused because as mentioned before, in noconti
case, both robot and goal position should be reset. Does that mean in self.done=True case, it will go to reset() function and automatically reset everything, including robot and goal's position (i.e. new episode)? Also what does "object" in update_layout()
refer to, is it constraint type like pillar, ghost, etc.?
Lastly, for cpo method, may I ask why you suggested to also specify the noconti
in the task name? It seems in the example provided in the github README file, it didn't use noconti
for cpo. (see last figure below)
=========================== Regarding (2.2) and Weiye's response previously
I used "_noconti" in the task name for training as suggested by (2.2) above, as well as testing on different target_cost values as suggested by Weiye. (terminal commands see below)
But I still found the cost performance different from your results in (2.2) above, in terms of the magnitude of the cost performance and cost rate. (see figures below)
Since the above tests in (2.2) also used mujoco3 as well as _noconti, I am not sure what might be the reason behind the difference?
Sorry for the long post, and thank you in advance for your help!
python cpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200
(note: default target_cost = 0 for cpo in code)
python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200
(note: default target_cost = -0.03 for scpo in code)
python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost 0 python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost -0.05 python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost -0.1 python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost -0.2 python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost -0.5
Hi Haihan, we suspect the difference is attributed to the upgrade from Mujoco-py to Mujoco3. The original experiments in SCPO paper run on Mujoco-py, which has been deprecated. We are currently running repeat experiments of Pillar task in new GUARD. We will keep you posted of the result after our initial testing.
In the meanwhile, you may want to continue the experiments by tuning the target_cost with wider range of choices. This target_cost represents the scale of RHS of (13) (https://arxiv.org/pdf/2306.12594.pdf) on different tasks. Try to use -0.05, -0.1, -0.2, -0.5 as such.
Sorry for the long post! And thank you in advance for your help!
Hi, Haihan,
Regarding (1): I compared the code with your clarification, and found that, in the code, target_cost is used to compute "c". And c is defined as c=EpMaxCost - target_cost. target_cost is from the user input when running the code. And if I interpret it correctly, EpMaxCost corresponds to J_D term in the 2nd constraint of Eqn 11 in the paper.
Sorry for my mistake, you're right here.
Does that mean in self.done=True case, it will go to reset() function and automatically reset everything, including robot and goal's position (i.e. new episode)? Also what does "object" in update_layout() refer to, is it constraint type like pillar, ghost, etc.?
(1) Yes, when self.done is True, a new episode is opened, at which point everything is reset by env.reset() funciton which is called in the main loop of the main file (cpo.py, scpo.py, etc). (2) I'm not familiar with this part of the code, but as far as I can deduce, the 'layout' should include the constraint as well as some of the objects needed for the task, such as the objects pushed by the robot in Push.
Lastly, for cpo method, may I ask why you suggested to also specify the noconti in the task name? It seems in the example provided in the github README file, it didn't use noconti for cpo. (see last figure below)
Because of the comparison with SCPO, the test environments for all algorithms need to be harmonized. Otherwise when calculating metrics such as EpRet, SCPO is an average over episodes, while the other algorithms are averaged over the maximum step size (e.g., max_ep_len = 1000). Multiple episodes may be included within a maximum step size.
Since the above tests in (2.2) also used mujoco3 as well as _noconti, I am not sure what might be the reason behind the difference?
I didn't succeed in reproducing your problem, there may be a difference in the configuration of the Pillar environment in safe_rl_env_config.py. My configuration is as follows:
if task == "Goal_Point_8Pillars":
config = {
# robot setting
'robot_base': 'xmls/point.xml',
# task setting
'task': 'goal',
'goal_size': 0.5,
# observation setting
'observe_goal_comp': True, # Observe the goal with a lidar sensor
'observe_pillars': True, # Observe the vector from agent to hazards
# constraint setting
'constrain_pillars': True, # Constrain robot from being in hazardous areas
'constrain_indicator': False, # If true, all costs are either 1 or 0 for a given step. If false, then we get dense cost.
# lidar setting
'lidar_num_bins': 16,
# object setting
'pillars_num': 8,
'pillars_size': 0.2,
}
In addition thank you for your interest in SCPO! If there is an interest in research collaboration, or if you would like to discuss it further, please feel free to contact us at weiyezha@andrew.cmu.edu, li-fh21@mails.tsinghua.edu.cn.
(1).
When tested the code on SCPO methods for Goal_Point_8Hazards, and Goal_Point_8Pillars tasks, only "hazard" task showed convergence of cost performance, not "pillar" related tasks. (see red cost performance curve in the figures below)
However, In the paper https://arxiv.org/pdf/2306.12594.pdf, (a). the experiments shown in Figure 7 and Figure 8 of Appendix D illustrated that SCPO method's cost performance converges to near-zero for both Point-Hazard-8 and Point-Pillar-8 tasks; (b). Regarding hyperparameters, Table 4 of the paper didn't differentiate on the choices of hyperparameters for different tasks.
Therefore, I would expect the cost performance of my test experiments above to be similar as what was shown in the figures from the paper, at least the trend (convergence), but for "pillar" (red cost performance curve) it clearly didn't converge.
Thus, I am wondering what would be the reason behind this difference?
(2).
And here is a set of terminal commands I used to execute my test experiments:
python cpo.py --task Goal_Point_8Hazards --seed 1 --model_save
python scpo.py --task Goal_Point_8Hazards --seed 1 --model_save
python cpo.py --task Goal_Point_8Ghosts --seed 1 --model_save
python scpo.py --task Goal_Point_8Ghosts --seed 1 --model_save
python cpo.py --task Goal_Point_8Pillars --seed 1 --model_save
python scpo.py --task Goal_Point_8Pillars --seed 1 --model_save