Closed lehduong closed 4 years ago
Thanks for your interest! I think the most important thing to notice is this environment by its default setting has very long time horizon (#11). The agent (or a shim between environment and the agent) needs to early terminate the episode if you want to implement curriculum learning.
For your questions,
we did normalize observation and reward so that they are roughly in the scale of 1. It's a manual normalization, you can collect some experience data and see their range for normalization.
There's clipping in observation from the environment https://github.com/park-project/park/blob/master/park/envs/load_balance/load_balance.py#L111-L115. I don't recall we clip the reward though.
We set it at some small value (hundreds of steps) and linearly increase (until hundreds of thousands steps I think) over time.
We used 1e-3 but that might differ from framework by framework (e.g., depends on the way you compute policy gradient or loss, whether it's averaged over a batch or sum over a batch). I would sweep some learning rate (e.g., 1e-4, 2e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2, etc.) and see which one gives you fastest convergence with similar performance.
Hope these help!
Thanks a lot for the helpful clarification.
Hi, I'm trying to reproduce the baseline results of A2C and PPO on load_balance environment. However, I find it very difficult to make the agents converge (in my experiments, the agents barely improve after millions of environment steps). Could you kindly clarify the implementation details regarding training these 2 approaches?
- Did you normalize the observations/rewards?
- Did you clip the observations/rewards?
- How did you schedule the
max_episode_steps
i.e. increasing it over time?- What learning rate did you use?
Thank you very much.
Hi, I also met the same problem that the A2C agent is not improving (I tried both original park and your forked code based on your changes). May I ask if you figured out this issue? Thank you in advance!
The original A2C agent is expected to struggle with this environment, because of the variance caused by the random job sequence. We wrote a paper describing this issue in https://openreview.net/forum?id=Hyg1G2AqtQ. You might want to try the agent from this codebase: https://github.com/hongzimao/input_driven_rl_example.
Hi, I'm trying to reproduce the baseline results of A2C and PPO on load_balance environment. However, I find it very difficult to make the agents converge (in my experiments, the agents barely improve after millions of environment steps). Could you kindly clarify the implementation details regarding training these 2 approaches?
max_episode_steps
i.e. increasing it over time?Thank you very much.