CVPO for Speed-constrained MuJoCo

liuzuxin / cvpo-safe-rl

Code for "Constrained Variational Policy Optimization for Safe Reinforcement Learning" (ICML 2022)

GNU General Public License v3.0

63 stars 7 forks source link

CVPO for Speed-constrained MuJoCo #10

Closed qianlin04 closed 1 year ago

qianlin04 commented 1 year ago

CVPO is an impressive work. Have you tried applying CVPO to Speed-constrained MuJoCo environment? In Speed-constrained MuJoCo setting, the constraint is on accumulated velocity, and the threshold and accumulated cost are discounted.

mujoco

I tried running CVPO directly for Speed-constrained MuJoCo, but the results were not as expected. In particular, the return decreased during training. Additionally, even under the unlimited setting, CVPO still performed poorly. Could you provide guidance and help? Did I miss some key hyperparameters or encounter some bugs?

cvpo_mujoco

liuzuxin commented 1 year ago

Hi @dolts4444 , thank you for your interest in our work and sorry about the late reply. I was preparing a new version of CVPO implementation as well as some other safe RL baselines and did the velocity experiments in Mujoco as you mentioned, which can hopefully address your issue. I didn't try this repo on the original Mujoco tasks, and it is not a surprise for not working well, because the hyper-parameters are not optimized for that long-horizon environments.

I tried my new version of CVPO on SafetyGymnasium velocity tasks, which I think is almost identical to the environments you are using, and the results are pretty good with the default hyper-parameters. I attached some results here:

SafetyHalfCheetahVelocityGymnasium-v1
SafetySwimmerVelocityGymnasium-v1
SafetyHopperVelocityGymnasium-v1

I haven't finished the hyperparameters tuning for all algorithms in the safety-gymnasium tasks, and once they are finished, I will upload the results.

Regarding the failures of CVPO in this repo, I conjecture that it is because of the poor Q estimation in these long-horizon tasks. Note that CVPO is a pure off-policy algorithm, and thus its performance is heavily dependent on the quality of both reward and cost Q estimations. If you check the Q values logs in your training, I suspect that the values may be exploded. I also found this issue in my previous experiments as well. There are many potential reasons for this: 1) the reward/cost are not normalized property, and the discounting factor is large -> bellman update can hardly converge. 2) the polyak update coefficient is not set properly and also causes the bellman update can hardly converge. 3) the replay buffer is small and the critics are overfitted to the most recent data, which may also cause exploded values. In my new repo, I use n-step returns (such as n=3) and a smaller discounting factor (such as 0.98) can mitigate this issue.

For other hyper-parameters such as the KL divergence in E-step and M-step, I found that CVPO is actually pretty robust to them, as long as they are in a suitable scale. Usually setting them to be smaller (such as E-step kl 0.005, and M-step KL 0.002) will work well for most tasks. If the training is still unstable, try to set them smaller (similar to the PPO TRPO).

Apparently, CVPO has its limitations -- it heavily relies on good Q estimations rather than on-policy style cost estimations, and it may not scale well on high action space dimensions due to the sampling procedure for the continuous action space. I plan to summarize these hyperparameters' effects, limitations, and some potential future works in my new repo.

The new repo is here. Since you asked this question, I released this package earlier than I planned, and thus the documentations are not complete yet, but I will try to update them ASAP. Feel free to try it in your tasks. Also, please do not hesitate to contact me if you encounter any issues or have any thoughts. Thanks.

qianlin04 commented 1 year ago

Thank for your reply. I am going to try the new implemetation of CVPO. Here I would like to share my attempt and thoughts on improving the orgin version of CVPO on MuJoCo task. I have tried to tune the following parameters:

Increasing the ‘episode_rerun_num’ to 100 and 300
Increasing the ‘sample_action_num’ to 1024, 2048, 4096 and 8192
Decreasing the ‘kl_mean_constraint’ and ‘kl_var_constraint’ to 0.0001

Unfortunately, these parameter tunings did not effectively prevent the decrease in return for HalfCheetah, Ant and Humanoid. I am puzzled as to why the return only dropped in HalfCheetah, Ant and Humanoid, while Hopper and Walker2d remained stable. I suspect the action space dimension may be the crucial factor.

Additionally, I observed some instereting phenomena:

There is a significant difference in the return for data collection and policy evalution. In HalfCheetah, the return during data collection is lower but steadily increases while the return during policy evaluation is higher but steadily decreases. However, in SafetyGym, the returns are consistent.
In the Ant environment, performance rapidly increases to about 1000 after only one epoch of tranning, but then decreases as trainning continues.

As for Q estimation, I found that the estimated values in Ant and Humanoid exploded as you methoned. In HalfCheetah, the estimated value did not exploded and normally converged, but the return still decrease and CVPO did not work. Besides, I wonder if the explosion of value estimation is a common issue for all off-policy safe RL methods like SAC-Lag or just specific to CVPO. If it is the latter, which component of CVPO is causing this problem?

By the way, I found directly applying CVPO to Humanoid will cause the runtime error where the loss explodes leading to the numerical issuses. Shrinking COV_CLAMP_MIN and COV_CLAMP_MAX to -1 and 1, respectively, can circumvent this issuse. https://github.com/liuzuxin/cvpo-safe-rl/blob/8bc57a91282321838d38fcfc17ba573ffd4ba67c/safe_rl/policy/model/mlp_ac.py#L174

In the end, I hope above problems have been solved and new implemetation of CVPO can work well for speed-constrained MuJoCo tasks.

liuzuxin commented 1 year ago

Hi @dolts4444, thank you for sharing your experiment results and thoughts! The above problems have been solved in the new implementations (please use the newly updated config file). I am testing on Humanoid and Ant, and it seems to work, although the training has not been finished. The sampling action num doesn't need to be very large, and 32 for ant and 64 for Humanoid should work:

SafetyAntVelocity-v1: SafetyHumanoidVelocity-v1:

I also have failed trials on the Ant and Humanoid tasks (sometimes Walker2D also fails but Halfcheetah consistently works well). And the main changes are all about the Q-estimation part rather than the EM parameters. For example, using a small buffer size works best for Halfcheetah, Hopper, and Swimmer, but totally fails for others. Setting discounting factor to 0.98 also works well for most tasks except Humanoid, etc. The exploded Q function phenomena is actually quite common for off-policy algorithms, which also troubled me a lot previously, and I found that there is no simple fix rather than tuning these Q-related parameters. All the EM-related parameters were kept the same across all Bullet-Safety-Gym and Safety-Gymnasium tasks. Therefore, I think the most critical parameters (and perhaps bottlenecks) are still from the Q learning side.

Another interesting thing that is worth noting is that the unbounded parameter plays a crucial role for certain tasks, for example, only setting it to True can work for most safety-gymnasium tasks. This parameter controls whether to apply tanh activation on final logits of the policy network. I haven't figured out why, but I suspect that it might be related to the action sampling procedure: using a bounded sampling policy restricts the agent from exploring more actions on the boundary, and also the supervised learning in M-step fails to backpropagate the gradient through the tanh function.

You should be able to directly obtain the results I pasted above by running python examples/mlp/train_cvpo_agent.py --task Safety[xx]VelocityGymnasium-v1 --use_default_cfg 1 for the speed-constrained tasks. Note that you need to use the most recent Safety-Gymnasium on their Github and install from the source. If you want to try it with your own environments, you can simply change the example scripts from import gymnasium as gym to import gym, and modify the environment names here.

I will conduct more experiments regarding these hyperparameters and summarize them in the doc when I have more time in the future. Let's keep synced for the experiments.

qianlin04 commented 1 year ago

Hi @liuzuxin, thank for your reply. It seems that the speed-constrained MuJoCo tasks I'm focusing on have some differences from SafetyGymnasium in cost function. The tasks I'm instereted in have constraints on the cumulative velocity, similar to those of FOPCOS and CUP ：

$$ \sum{t=0} \gamma^t v{t} < b $$

And the threshold values for all tasks are set to

{ 'Ant-v3': 103.115,
  'HalfCheetah-v3': 151.989,
  'Hopper-v3': 82.748,
  'Humanoid-v3': 20.140,
  'Swimmer-v3': 24.516,
  'Walker2d-v3': 81.886}

On the other hand, the cost in SafetyGymnasium seems to be $c_t = 1( vt < v{\max}) $, with threshold values of 25 for all tasks. I plan to conduct some experiments to see if the CVPO algorithm is effective in my setting as well.

qianlin04 commented 1 year ago

Hi @liuzuxin, I'm delighted to see that all the issues mentioned above have been resolved in the new implementation of CVPO. Under cumulative velocity constraint settings, CVPO works well without much parameter tuning for all environments except HalfCheetah. In HalfCheetah, the cost value estimation has satisfied the constraint (150), but the cumulative cost exceeds the threshold during evaluation (note that 'test/cost' has been discounted):

Untitled

Furthermore, there is a significant difference in the return for data collection and policy evaluation:

无标题

This problem did not occur in other environments. I am trying to fix it. Can you provide some suggestions?

liuzuxin commented 1 year ago

Hi @dolts4444 , glad to hear it works! I conjecture the difference between train and eval results is because of the default deterministic_eval parameter here is True, meaning that during training, the agent uses a stochastic policy (perhaps with a large variance to encourage exploration), while during testing, the agent uses the mean of the output distribution, which can be regarded as using the action with the highest prob density.

If you encounter such a dramatic difference between train and test, it is possible that the variance prediction did not converge, because of the small mstep_kl_std I set here. If you check the mstep/entropy metric in wandb, you may find the policy entropy is large or gradually increasing without decreasing. An ideal curve for this metric should increase first and then decrease monotonically, such as the following figures: The theory behind it can be found in Sec. 4.2.1 in this paper, which is the reason why we use decoupled mean and std KL regularizations. Intuitively, the increasing phase encourages exploration, while the decreasing phase means that the algorithm has found the optimal action and started to shrink the exploration range. If you set the mstep_kl_std parameter larger, such as to 0.005 or even larger, the variance may converge faster to avoid the issue. Another possible way is to change the conditioned_sigma parameter here to False, which corresponds to using a state-independent variance parameter to control exploration and is thus easier to converge. Both ways are worthy to try, and please let me know if the above comments help.

qianlin04 commented 1 year ago

Hi @liuzuxin, thank for your patient explanation. I have followed your suggestion, but there are still some issues regarding the entropy in certain environments. Specifically, in the Humanoid environment, the entropy increases to a certain value and then remains unchanged. In the Ant environment, the entropy value decreases and even reaches negative values.

liuzuxin commented 1 year ago

Hi @qianlin04 ,Mujoco's Humanoid and Ant are indeed pretty challenging for CVPO. Unfortunately, I haven't got time to tune on these two environments specifically, but I think the performance should be related to the base algorithm parameters, such as buffer sizes, hidden sizes/layers, etc, because these environments are quite high-dimensional (so perhaps more action sampling num might also be helpful). The parameters might be significantly different from the default ones. By the way, I don't really think these environments are very useful to testing safe RL algorithms, because they are more like testing the base algorithm's reward-seeking performance......