rlberry-py / rlberry

An easy-to-use reinforcement learning library for research and education.
https://rlberry-py.github.io/rlberry
MIT License
162 stars 30 forks source link

A2C fix #160

Closed KohlerHECTOR closed 2 years ago

KohlerHECTOR commented 2 years ago

https://github.com/rlberry-py/rlberry/blob/8168dfc73a802dc9a6308e16a7d7bdce715d4f17/rlberry/agents/torch/a2c/a2c.py#L246-L273

It seems there is a useless for loop in the main training loop of A2C that is not in the original algorithm. After removing the loop, performances of the A2C agent match stable-baselines' A2C. @riccardodv and I think the problem is that this loop induces repeated gradient steps in the same direction for k_epochs steps.

yfletberliac commented 2 years ago

Good catch! Also interesting to see that A2C does better without multiple gradient steps whereas PPO does.

mmcenta commented 2 years ago

Good catch! Also interesting to see that A2C does better without multiple gradient steps whereas PPO does.

It may be because PPO adds the constraint that the policy must be close to the one used for data collection.

@KohlerHECTOR Any updates on this? I've seen on Slack that it might have been a hyperparameter issue?

KohlerHECTOR commented 2 years ago

Yes I agree with @mmcenta , PPO does not have the same objective function as A2C. I think the bad performances of A2C were due to repeated gradient steps. There is another point that is that the rlberry A2C default hyperparameters are not the same as SB3's but it is not an issue.