openai / safety-starter-agents

Basic constrained RL agents used in experiments for the "Benchmarking Safe Exploration in Deep Reinforcement Learning" paper.
https://openai.com/blog/safety-gym/
MIT License
383 stars 113 forks source link

sac-lagrangian shows poor performance on PointGoal1? #4

Open hari-sikchi opened 4 years ago

hari-sikchi commented 4 years ago

On running the lagrangian version of SAC I get the following curve for costs. I tried changing the constraint limits to a range of values and didn't get much benefit:

lagrangian_sac_pointgoal1

Am I doing something wrong, or is this expected in offpolicy algorithms?

zhihaocheng commented 3 years ago

have you solved this problem?

flodorner commented 3 years ago

There is probably a reason why the SAC results were omitted in the safety gym paper. I am currently trying to understand what the problem might be.

flodorner commented 3 years ago

It is also worth noting that the results for PPO and TROP in the paper train 100x as long as you seem to have trained and that they need around 10e6 steps to reach costs of 25,

flodorner commented 3 years ago

It seems like both SAC and TD3 are quite bad at accurately estimating Q-values for the cost, such that the approach doesn't even work very well for a fixed cost penalty (at least in the safety gym environments). In a toy environment where this was no problem, things still did not work very well due to the volatility of the Q-estimates which did not play well with the updating of the lagrangian multiplier.

hari-sikchi commented 3 years ago

I agree. I have experimented with longer training times as well, but it doesn't seem to work well. In off-policy methods mostly building on Q-learning the Q estimates obtained are not evaluations of the current policy (We are only evaluating the current policy for 1 or 2 gradient steps rather than till convergence), so I believe off-policy methods will have some issue in this lagrangian approach.

pengzhenghao commented 3 years ago

It seems like both SAC and TD3 are quite bad at accurately estimating Q-values for the cost, such that the approach doesn't even work very well for a fixed cost penalty (at least in the safety gym environments). In a toy environment where this was no problem, things still did not work very well due to the volatility of the Q-estimates which did not play well with the updating of the lagrangian multiplier.

Hi @flodorner ! Do you have tried TD3 with lagrangian method? According to your statement it performs poorly right? I am also investigating such issue. May I ask do you find any public implementation of the Lagrangian method in the off-policy algos like SAC and TD3?

Thanks!

hnyu commented 3 years ago

It seems like both SAC and TD3 are quite bad at accurately estimating Q-values for the cost, such that the approach doesn't even work very well for a fixed cost penalty (at least in the safety gym environments). In a toy environment where this was no problem, things still did not work very well due to the volatility of the Q-estimates which did not play well with the updating of the lagrangian multiplier.

Hi @flodorner ! Do you have tried TD3 with lagrangian method? According to your statement it performs poorly right? I am also investigating such issue. May I ask do you find any public implementation of the Lagrangian method in the off-policy algos like SAC and TD3?

Thanks!

Hi, @pengzhenghao if you are still looking for any Lagrangian method with off-policy algorithms like SAC, please check out our recent implementation. We achieved even better performance on PointGoal1 and CarGoal1 compared to PPO in the original paper.

(the first figure is undiscounted reward return and the second is negative cost) image

The code is at https://github.com/HorizonRobotics/alf/blob/pytorch/alf/examples/sac_lagrw_cargoal1_conf.py

hnyu commented 3 years ago

I agree. I have experimented with longer training times as well, but it doesn't seem to work well. In off-policy methods mostly building on Q-learning the Q estimates obtained are not evaluations of the current policy (We are only evaluating the current policy for 1 or 2 gradient steps rather than till convergence), so I believe off-policy methods will have some issue in this lagrangian approach.

According to our investigation, the key for an off-policy algorithm like SAC to work in this scenario is to directly adjust lambda according to the rollout rewards (yes a little hacky and unprincipled but this worked) and also use multi-step TD loss (similar to the official PPO multi-step GAE). Please see my link to the code above.

zhihaocheng commented 3 years ago

I agree. I have experimented with longer training times as well, but it doesn't seem to work well. In off-policy methods mostly building on Q-learning the Q estimates obtained are not evaluations of the current policy (We are only evaluating the current policy for 1 or 2 gradient steps rather than till convergence), so I believe off-policy methods will have some issue in this lagrangian approach.

According to our investigation, the key for an off-policy algorithm like SAC to work in this scenario is to directly adjust lambda according to the rollout rewards (yes a little hacky and unprincipled but this worked) and also use multi-step TD loss (similar to the official PPO multi-step GAE). Please see my link to the code above.

Hi @hnyu, my understanding is that you collect on-policy trajectories to update the Lagrangian Multiplier, which is used to penalize the cost. Is my understanding right? Besides, you showed a figure of your algorithm, showing that SAC-Lagrangian could work better than PPO-Lagrangian. Do you carry out more experiments in different environments to demonstrate that SAC-Lagrangian could perform better than PPO-Lagrangian consistently?

hnyu commented 3 years ago

I agree. I have experimented with longer training times as well, but it doesn't seem to work well. In off-policy methods mostly building on Q-learning the Q estimates obtained are not evaluations of the current policy (We are only evaluating the current policy for 1 or 2 gradient steps rather than till convergence), so I believe off-policy methods will have some issue in this lagrangian approach.

According to our investigation, the key for an off-policy algorithm like SAC to work in this scenario is to directly adjust lambda according to the rollout rewards (yes a little hacky and unprincipled but this worked) and also use multi-step TD loss (similar to the official PPO multi-step GAE). Please see my link to the code above.

Hi @hnyu, my understanding is that you collect on-policy trajectories to update the Lagrangian Multiplier, which is used to penalize the cost. Is my understanding right? Besides, you showed a figure of your algorithm, showing that SAC-Lagrangian could work better than PPO-Lagrangian. Do you carry out more experiments in different environments to demonstrate that SAC-Lagrangian could perform better than PPO-Lagrangian consistently?

Yeah your understanding is correct (also multi-step TD loss is important). Note that my previous statement is not "SAC-Lagrangian is better than PPO-Lagrangian". I only said the former is better than the latter on PointGoal1 and CarGoal1 according to my observation. Actually I also did experiments on PointGoal2 and CarGoal2, and SAC was not able to beat reported PPO (both methods were bad). For more experiments, I currently do not have any plan yet.

zhihaocheng commented 3 years ago

@hnyu , many thanks for your detailed explanation.

JasonMa2016 commented 3 years ago

Thanks for the great discussion - I am currently going through the code for sac-lagrangian. Could someone explain to me the rationale behind Line 374 in sac-lagrangian code https://github.com/openai/safety-starter-agents/blob/4151a283967520ee000f03b3a79bf35262ff3509/safe_rl/sac/sac.py#L374. In particular, I don't understand why you would divide by max_ep_len? In the line below, qc is the discounted sum of future cost according to the cost critic, so shouldn't it just be cost_constraint = cost_lim * (1 - gamma ** max_ep_len) / (1 - gamma) ?

Thanks!

yardenas commented 3 years ago

@JasonMa2016 I'm also trying to wrap my head around this. I think that this comes from the assumption that the accumulated cost is equal at each step (see here). Hence, the undiscounted finite horizon return is just r * T. On the other side, the finite horizon discounted return (using the same assumption) is r*(1 - gamma ** T) / (1 - gamma), so the scaling factor to make the two equal turns to (1 - gamma ** T) / (1 - gamma) / T. Again, not sure, but this is how I explained it to myself.

JasonMa2016 commented 3 years ago

@yardenas Thanks for the helpful explanation! So is it correct to say that cost_lim as an input should be cost tolerance at each timestep * number of time steps? If this is the case, then the derivation makes sense for converting from undiscounted total cost lim to a discounted total cost lim to be comparable to the value a cost critic would output.

yardenas commented 3 years ago

@JasonMa2016 Yes, I think so, although from my experience this scaling does not help to much. Did you try running it with cost_lim == 25?

Gaiejj commented 2 months ago

Hello, this should be due to the problem of actor learning rate and Lagrangian multiplier learning rate. Experiments have found that setting them to very small values will significantly improve the performance of off-policy saferl algorithms such as sac_lag, td3_lag, etc. More experimental results and hyperparameters can be found at OmniSafe: https://github.com/PKU-Alignment/omnisafe

image