[Questions] Performance with default ddpg, sac, td3 and other confuse

Gaiejj commented 1 year ago

As one of the maintainers of Omnisafe, we are currently working on adding CVPO to our supported algorithm list. You can find more information about Omnisafe on our homepage at https://github.com/OmniSafeAI/omnisafe. However, during our implementation process, we encountered a few challenges that we hope to address with the original author's guidance.

Firstly, we noticed that CVPO sets the random layout to false, as shown in the code at https://github.com/liuzuxin/cvpo-safe-rl/blob/main/envs/safety-gym/safety_gym/envs/engine.py#L118. This implies that the environment lacks randomness, with obstacles and goal positions remaining unchanged for each epoch. We are curious about the reason behind this design choice and how it aligns with the safety objectives of CVPO.
Secondly, while the author has implemented the Lagrangian versions of SAC, DDPG, and TD3, we encountered challenges while testing them in our modified interface with a true random layout. Specifically, when we ran SAC, DDPG, and TD3 on velocity, we consistently obtained a reward of zero, which differs from the results obtained in other platforms such as Tianshou and Stable Baseline3. This disparity raises concerns about the efficacy of CVPO's Lagrangian versions. We would appreciate insights on how to ensure that they function effectively.
Lastly, we are curious about why CVPO changed the max episode length of the original setting of Safety Gym from 1000 to 500 in the experiment. It would be helpful to understand the rationale behind this decision.

Overall, we seek clarification from the original author to avoid any potential misunderstandings and ensure the successful integration of CVPO into Omnisafe.

liuzuxin commented 1 year ago

Hi @Gaiejj , thanks for your questions and the great work you have done for OmniSafe.

I combine the answers for Q1 and Q3 as follows:

The reasons for using fixed layouts instead of random ones are explained in this issue. It basically reduces the task difficulty and substantially decreases the training time. For this SciPy-version CVPO, the training speed is very low.
I did not only change the max episode length, but also the simulation steps for the car and point robot. As you can see in the car xml file, the sim time-step is 0.01 while the one in original safety gym is 0.004. Therefore, 500 steps in my modified env is kind of similar to the total simulation time in the original env, but the training time will be greatly reduced since we have to collect much fewer interaction data for each episode. By the way, I think it may not be necessary to use such a small simulation timestep for Mujoco tasks if we do not plan to deploy the model on a real robot, since it may increase the task difficulty due to the longer horizon and waste more computing resources. I am not sure whether this will also be true for your safety-gymnasium repo, such as gaining better training efficiency by increasing this sim timestep value a bit.

For Q2, what do you mean by using a true random layout for the velocity task? Are the results here using similar implementations in this repo? Note that my implementations mainly follows the spiningup repo, and it should work well for most non-safe RL tasks, given properly selected hyper-parameters. One thing I could imagine to cause the disparity is the parameter step_per_sample in omnisafe code and the episode_rerun_num parameter in my repo. I noticed that this parameter will greatly influence the training stability of off-policy algorithms. Reducing this parameter may help to improve stability but may lose sample efficiency. Another common failure mode for off-policy algorithms is the bad TD-style estimation of Q functions. Using n-steps return or other method like retrace can help to improve the Q function training stability. Without a good estimation of Q, all off-policy optimizations will fail. Without more details regarding your problem, I can only come up with the two potential problems.

Also, I will release a new implementation of CVPO, DDPG-Lag, SAC-Lag based on Tianshou very soon in the coming weeks. My new implementations of CVPO and DDPG/SAC would be much faster than the current one; For example, regarding the CarCircle task, CVPO converges within 30 mins, and SAC-Lag converges within 15 mins. You may refer to the new implementations for better integration to OminiSafe at that time.

Gaiejj commented 1 year ago

Thanks for your early reply, which is surely insightful. I try to run your implementation of CVPO in SafetyPointGoal1-v0 and SafetyCarGoal1-v0 in https://github.com/OmniSafeAI/safety-gymnasium . I will also implement CVPO in omnisafe within few days. Then I will create a Pull Request and invite you to review my code. Thanks for your great help if you can provide some hints for my implementation !

liuzuxin commented 1 year ago

Sure. Looking forward to that!

liuzuxin / cvpo-safe-rl

[Questions] Performance with default ddpg, sac, td3 and other confuse #9