google-deepmind / mujoco

Multi-Joint dynamics with Contact. A general purpose physics simulator.
https://mujoco.org
Apache License 2.0
7.85k stars 786 forks source link

Changing num_evals alter the training process? #1259

Closed vassil-atn closed 9 months ago

vassil-atn commented 9 months ago

Hi,

I'm trying to get more familiar with MJX to use it for quadruped RL training and so I've been going through the published tutorial.ipynb. I've noticed that the results plotting depends on the num_evals parameters given to the PPO algorithm - which I assume means how many times to evaluate the policy throughout the whole training, i.e. num_evals=10 for 10k steps of training means a policy evaluation every 1k steps. What I don't understand is why modifying this parameter completely changes the policy, for example:

These are the results for the quadruped environment with num_evals=2: image

And these are the results for the same environment with num_evals=20: image

From what I can see in the code, it seems like the amount of steps per training epoch depends on num_evals and thus changes the behaviour. Is there a way to go around this? I'd like to evaluate the policy more often (rather than twice for the whole training cycle) and better see how the reward changes.

btaba commented 9 months ago

Hi @Vassil17

Thanks for bringing this to our attention. Indeed it's a curious finding. My initial hunch is that there is an interaction between num_evals and num_resets_per_eval, more or less changing the random seed of your run. In practice we find that training on the quadruped env has high variance across runs. This is something we need to debug further to make training more stable, and we'll keep an eye out if there is an underlying bug in the training code. Let's follow-up here https://github.com/google/brax/issues/433