Closed vassil-atn closed 2 months ago
I actually found the issue - num_learning_epochs is not used as a parameter in the new version (so each batch is used once only for updating). If you modify the code and run it with 5 epochs (like the master branch PPO), it works fine:
On that note, is there a particular reason for changing the behaviour to not allow training over multiple epochs?
Hi, First of all, thank you for the algorithm implementations! I've tried migrating from the master branch to the algorithms one (since it supports more algorithms than just PPO), and testing it on the flat terrain ANYmal locomotion env in Isaac Lab. For some reason there is a huge gap in performance between the master branch PPO and the algorithms branch PPO - the latter exhibits a much larger reward variance and does not learn to follow the velocity commands at all in the 300 update steps. I use the same environment for both and the LegacyRunner for the algorithms branch. I've uploaded two screenshots of the training curves - interestingly, they start off the same but then diverse after several update steps.
Has anyone encountered this issue before?