Which environment/task was used for the ICML 2022 paper?

greg3566 commented 2 years ago

Which environment/task was used for the ICML 2022 paper: "Constrained Variational Policy Optimization for Safe Reinforcement Learning" (Liu, et al. 2022)? I am confused because script/goal.py contains a button environment, not a goal, in ENV_LIST.

liuzuxin commented 2 years ago

You are right, the goal environment is actually called button2 in the code. In the paper Sec. 4.1, there are descriptions of this task. The difference between the goal and the button task is whether there are gremlins (dynamic obstacles) in the environment. There are several reasons to do this when I ran the experiments:

Better to control the randomness. The original goal task in SafetyGym is essentially doing the same thing as the button task, except 1) the obstacle types and layout are different, and 2) whether the goal locations are fixed. The original goal task will random sample goal locations and it is very hard to control its randomness in the code, which means that I may need to evaluate each agent for plenty of episodes to characterize the randomness for fair comparison. So instead, I just modify the button environment, where the goal locations and the obstacle layouts could be easily fixed.
Reduce the training time. Also because of the randomness of the goal locations in the original goal task, it takes longer to train the agent. Since CVPO has an inner convex optimization (currently implemented via SciPy, will change it to PyTorch later), the training speed is actually very slow. So in order to save the training time, it is better to fix the randomness of the environment to make it converge faster. The comparison is still valid as long as all the agents are in the same set of tasks.

Please let me know if you are still confused! Thanks.

yardenas commented 2 years ago

Hi @liuzuxin,

First of all impressive work on the implementation and paper! Congrats :)

Second,

Reduce the training time. Also because of the randomness of the goal locations in the original goal task, it takes longer to train the agent. Since CVPO has an inner convex optimization (currently implemented via SciPy, will change it to PyTorch later), the training speed is actually very slow. So in order to save the training time, it is better to fix the randomness of the environment to make it converge faster. The comparison is still valid as long as all the agents are in the same set of tasks.

If you feel comfortable with it, a JAX implementation could be really easy as it has many numpy/scipy functions built-in with XLA compilation support. If you want, I can help you with that :)

liuzuxin commented 2 years ago

Hi @yardenas,

Thank you so much! Could you point me some examples' link of JAX for accelerating scipy functions so that I can take a look? It would be very helpful :)

greg3566 commented 2 years ago

Thank you for your detailed response.

yardenas commented 2 years ago

Hi @yardenas,

Thank you so much! Could you point me some examples' link of JAX for accelerating scipy functions so that I can take a look? It would be very helpful :)

@liuzuxin, I have a CPO implementation in JAX here. It's still not tested comprehensively though

liuzuxin / cvpo-safe-rl

Which environment/task was used for the ICML 2022 paper? #1