Changing num_evals alter the training process?

google-deepmind / mujoco

Multi-Joint dynamics with Contact. A general purpose physics simulator.

Apache License 2.0

7.85k stars 786 forks source link

Hi,

I'm trying to get more familiar with MJX to use it for quadruped RL training and so I've been going through the published tutorial.ipynb. I've noticed that the results plotting depends on the num_evals parameters given to the PPO algorithm - which I assume means how many times to evaluate the policy throughout the whole training, i.e. num_evals=10 for 10k steps of training means a policy evaluation every 1k steps. What I don't understand is why modifying this parameter completely changes the policy, for example:

These are the results for the quadruped environment with num_evals=2:

And these are the results for the same environment with num_evals=20:

From what I can see in the code, it seems like the amount of steps per training epoch depends on num_evals and thus changes the behaviour. Is there a way to go around this? I'd like to evaluate the policy more often (rather than twice for the whole training cycle) and better see how the reward changes.

google-deepmind / mujoco

Changing num_evals alter the training process? #1259