Closed KohlerHECTOR closed 2 years ago
Good catch! Also interesting to see that A2C does better without multiple gradient steps whereas PPO does.
Good catch! Also interesting to see that A2C does better without multiple gradient steps whereas PPO does.
It may be because PPO adds the constraint that the policy must be close to the one used for data collection.
@KohlerHECTOR Any updates on this? I've seen on Slack that it might have been a hyperparameter issue?
Yes I agree with @mmcenta , PPO does not have the same objective function as A2C. I think the bad performances of A2C were due to repeated gradient steps. There is another point that is that the rlberry A2C default hyperparameters are not the same as SB3's but it is not an issue.
https://github.com/rlberry-py/rlberry/blob/8168dfc73a802dc9a6308e16a7d7bdce715d4f17/rlberry/agents/torch/a2c/a2c.py#L246-L273
It seems there is a useless for loop in the main training loop of A2C that is not in the original algorithm. After removing the loop, performances of the A2C agent match stable-baselines' A2C. @riccardodv and I think the problem is that this loop induces repeated gradient steps in the same direction for k_epochs steps.