Closed vassil-atn closed 9 months ago
Hi @Vassil17
Thanks for bringing this to our attention. Indeed it's a curious finding. My initial hunch is that there is an interaction between num_evals
and num_resets_per_eval
, more or less changing the random seed of your run. In practice we find that training on the quadruped env has high variance across runs. This is something we need to debug further to make training more stable, and we'll keep an eye out if there is an underlying bug in the training code. Let's follow-up here https://github.com/google/brax/issues/433
Hi,
I'm trying to get more familiar with MJX to use it for quadruped RL training and so I've been going through the published tutorial.ipynb. I've noticed that the results plotting depends on the num_evals parameters given to the PPO algorithm - which I assume means how many times to evaluate the policy throughout the whole training, i.e. num_evals=10 for 10k steps of training means a policy evaluation every 1k steps. What I don't understand is why modifying this parameter completely changes the policy, for example:
These are the results for the quadruped environment with num_evals=2:
And these are the results for the same environment with num_evals=20:
From what I can see in the code, it seems like the amount of steps per training epoch depends on num_evals and thus changes the behaviour. Is there a way to go around this? I'd like to evaluate the policy more often (rather than twice for the whole training cycle) and better see how the reward changes.