intelligent-control-lab / guard

MIT License
39 stars 6 forks source link

Episodic Cost Performance not converge/very large for "pillar" constraints using SCPO/CPO #5

Closed Haihan-W closed 6 months ago

Haihan-W commented 6 months ago

(1).

Cost_Performance

(2).

Note: In my experiment, similar results occurred for "Ghost" and "Pillar" related tasks where ghost is set to non-trespassable. But since paper didn't discuss the results related to "ghost" related tasks, therefore I cannot compare between paper and my experiment.

And here is a set of terminal commands I used to execute my test experiments: python cpo.py --task Goal_Point_8Hazards --seed 1 --model_save python scpo.py --task Goal_Point_8Hazards --seed 1 --model_save

python cpo.py --task Goal_Point_8Ghosts --seed 1 --model_save python scpo.py --task Goal_Point_8Ghosts --seed 1 --model_save

python cpo.py --task Goal_Point_8Pillars --seed 1 --model_save python scpo.py --task Goal_Point_8Pillars --seed 1 --model_save

CaesarAndylaw commented 6 months ago

Hi Haihan, we suspect the difference is attributed to the upgrade from Mujoco-py to Mujoco3. The original experiments in SCPO paper run on Mujoco-py, which has been deprecated. We are currently running repeat experiments of Pillar task in new GUARD. We will keep you posted of the result after our initial testing.

In the meanwhile, you may want to continue the experiments by tuning the target_cost with wider range of choices. This target_cost represents the scale of RHS of (13) (https://arxiv.org/pdf/2306.12594.pdf) on different tasks. Try to use -0.05, -0.1, -0.2, -0.5 as such.

Haihan-W commented 6 months ago

Thank you Weiye for your quick response!

I will give it a try using different target_cost and also looking forward to the results of your testing.

Just a clarification, not sure if I am understanding it correctly, but should the target_cost a non-negative number? In this case, should I try 0.05,0.1, 0.2, 0.5 instead of -0.05, -0.1, -0.2, -0.5 as you suggested?

Because from my understanding, a zero target_cost ($w_i$) means no state constraint violation is allowed, which is the default option used in the SCPO paper (https://arxiv.org/pdf/2306.12594.pdf), and positive target_cost means some violation of state constraints is allowed (relaxed condition). Then what does negative target_cost (e.g. -0.05, -0.1) mean?

Thank you!

LiangZhisama commented 6 months ago

Hi Haihan, sorry for the late response. (1) Constraints of the original optimization problem are $E[KL]\leq\delta,~J_{D_i}+E[A_D+2(H+1)\epsilon\sqrt{0.5KL}]\leq {wi}$. We simplify the implementation of the later term as $J{D_i} + \alpha \leq {w_i}$, where $\alpha$ is a constant, $w_i$ is the target cost. Thus we can set target cost as $w_i - \alpha$ to implement the theory, which means the actual target cost is still the non-negative number (set as zero to align other algorithms). For example, if you set target cost as -0.03, you actually set $\alpha$, the approximation of $E[A_D+2(H+1)\epsilon\sqrt{0.5KL}]$, as 0.03.

(2.1) One important thing to mention that, SCPO-concerning experiments should be run in noconti environments. The difference caused by whether the environment of the execution is set to noconti or not is that the environment of noconti will reset the whole environment after a successful mission (e.g., after reaching the goal, the positions of both the robot and the goal will be randomly reset; however, if it is not set, only the position of the goal will be reset, while the robot's state will remain unchanged until the run to the maximal episode size resets the environment as a whole). The reason for setting noconti for SCPO is that its theory considers a complete single task run to compute the maximum value. If the next task is run without resetting the environment, the maximum loss value from the previous computation affects that task, causing the theory to fail and the algorithm to inherently fail to produce reasonable results.

In short, your terminal commands should be changed to: python cpo.py --task Goal_Point_8Hazards_noconti --seed 1 --model_save python scpo.py --task Goal_Point_8Hazards_noconti --seed 1 --model_save

python cpo.py --task Goal_Point_8Ghosts_noconti --seed 1 --model_save python scpo.py --task Goal_Point_8Ghosts_noconti --seed 1 --model_save

python cpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save

(2.2) The change of environment engine indeed casues some unexpected results in the performance of pillar tasks. We tested task Goal_Point_8Pillars_noconti and get the follow results: e23fd7d0bde0b11472ae61b7894e0fd 8dedd0bdbc8724493a1527a3c66e2c4 7fd94133c2cc9dde1e42eb4b8c5fc9f Target cost here is set to -0.03. We can see that SCPO outperforms CPO, although there is something wrong with the Mujoco3 environment. And we will further debug the new Mujoco3 environments and update it as soon as possible.

Haihan-W commented 6 months ago

Hi Feihan,

Thank you so much for your clarification!

Regarding (1): I compared the code with your clarification, and found that, in the code, target_cost is used to compute "c". And c is defined as c=EpMaxCost - target_cost. target_cost is from the user input when running the code. And if I interpret it correctly, EpMaxCost corresponds to J_D term in the 2nd constraint of Eqn 11 in the paper. image image

Also in the paper, it defined term "c" = J_D+2(H+1)ϵ*sqrt(0.5KL) - wi, and used "c" in the 2nd constraint of convex programming in the pseudo code. image image

Combining above, I derived $target\_cost = w_i - 2(H+1)ϵ*sqrt(0.5KL)$, (see below), therefore, would the α be $2(H+1)ϵ*sqrt(0.5KL)$, i.e. excluding $A_D$ term that you mentioned previously? 80755d09779bd77a8d8f95afe145aa6

Hi Haihan, sorry for the late response. (1) Constraints of the original optimization problem are E[KL]≤δ, JDi+E[AD+2(H+1)ϵ0.5KL]≤wi. We simplify the implementation of the later term as JDi+α≤wi, where α is a constant, wi is the target cost. Thus we can set target cost as wi−α to implement the theory, which means the actual target cost is still the non-negative number (set as zero to align other algorithms). For example, if you set target cost as -0.03, you actually set α, the approximation of E[AD+2(H+1)ϵ0.5KL], as 0.03.

=========================== Regarding (2.1): Thanks for pointing it out! I investigate the code in safe_rl_env_config.py and found that noconti specified in the task name impacts the config['continue_goal'] value, if the noconti is included, it will make config['continue_goal'] equal to False.

image

But in the engine.py file, it seems that if continue_goal is TRUE, it will update layout (with new position of objects) as well as build a new goal, but if continue_goal is FALSE (i.e. noconti case), it simply update self.done = True. Therefore I am a bit confused because as mentioned before, in noconti case, both robot and goal position should be reset. Does that mean in self.done=True case, it will go to reset() function and automatically reset everything, including robot and goal's position (i.e. new episode)? Also what does "object" in update_layout() refer to, is it constraint type like pillar, ghost, etc.?

Lastly, for cpo method, may I ask why you suggested to also specify the noconti in the task name? It seems in the example provided in the github README file, it didn't use noconti for cpo. (see last figure below)

image image image

image

=========================== Regarding (2.2) and Weiye's response previously

I used "_noconti" in the task name for training as suggested by (2.2) above, as well as testing on different target_cost values as suggested by Weiye. (terminal commands see below)

But I still found the cost performance different from your results in (2.2) above, in terms of the magnitude of the cost performance and cost rate. (see figures below)

Since the above tests in (2.2) also used mujoco3 as well as _noconti, I am not sure what might be the reason behind the difference?

Sorry for the long post, and thank you in advance for your help!

python cpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 (note: default target_cost = 0 for cpo in code) python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 (note: default target_cost = -0.03 for scpo in code) python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost 0 python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost -0.05 python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost -0.1 python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost -0.2 python scpo.py --task Goal_Point_8Pillars_noconti --seed 1 --model_save --epochs 200 --target_cost -0.5

image

Cost_Rate_Performance

image

Hi Haihan, we suspect the difference is attributed to the upgrade from Mujoco-py to Mujoco3. The original experiments in SCPO paper run on Mujoco-py, which has been deprecated. We are currently running repeat experiments of Pillar task in new GUARD. We will keep you posted of the result after our initial testing.

In the meanwhile, you may want to continue the experiments by tuning the target_cost with wider range of choices. This target_cost represents the scale of RHS of (13) (https://arxiv.org/pdf/2306.12594.pdf) on different tasks. Try to use -0.05, -0.1, -0.2, -0.5 as such.

Sorry for the long post! And thank you in advance for your help!

LiangZhisama commented 6 months ago

Hi, Haihan,

Regarding (1): I compared the code with your clarification, and found that, in the code, target_cost is used to compute "c". And c is defined as c=EpMaxCost - target_cost. target_cost is from the user input when running the code. And if I interpret it correctly, EpMaxCost corresponds to J_D term in the 2nd constraint of Eqn 11 in the paper.

Sorry for my mistake, you're right here.

Does that mean in self.done=True case, it will go to reset() function and automatically reset everything, including robot and goal's position (i.e. new episode)? Also what does "object" in update_layout() refer to, is it constraint type like pillar, ghost, etc.?

(1) Yes, when self.done is True, a new episode is opened, at which point everything is reset by env.reset() funciton which is called in the main loop of the main file (cpo.py, scpo.py, etc). (2) I'm not familiar with this part of the code, but as far as I can deduce, the 'layout' should include the constraint as well as some of the objects needed for the task, such as the objects pushed by the robot in Push.

Lastly, for cpo method, may I ask why you suggested to also specify the noconti in the task name? It seems in the example provided in the github README file, it didn't use noconti for cpo. (see last figure below)

Because of the comparison with SCPO, the test environments for all algorithms need to be harmonized. Otherwise when calculating metrics such as EpRet, SCPO is an average over episodes, while the other algorithms are averaged over the maximum step size (e.g., max_ep_len = 1000). Multiple episodes may be included within a maximum step size.

Since the above tests in (2.2) also used mujoco3 as well as _noconti, I am not sure what might be the reason behind the difference?

I didn't succeed in reproducing your problem, there may be a difference in the configuration of the Pillar environment in safe_rl_env_config.py. My configuration is as follows:

if task == "Goal_Point_8Pillars":
      config = {
          # robot setting
          'robot_base': 'xmls/point.xml',  

          # task setting
          'task': 'goal',
          'goal_size': 0.5,

          # observation setting
          'observe_goal_comp': True,  # Observe the goal with a lidar sensor
          'observe_pillars': True,  # Observe the vector from agent to hazards

          # constraint setting
          'constrain_pillars': True,  # Constrain robot from being in hazardous areas
          'constrain_indicator': False,  # If true, all costs are either 1 or 0 for a given step. If false, then we get dense cost.

          # lidar setting
          'lidar_num_bins': 16,

          # object setting
          'pillars_num': 8,
          'pillars_size': 0.2,
      }
LiangZhisama commented 6 months ago

In addition thank you for your interest in SCPO! If there is an interest in research collaboration, or if you would like to discuss it further, please feel free to contact us at weiyezha@andrew.cmu.edu, li-fh21@mails.tsinghua.edu.cn.