Experimental settings of A2C/PPO agents on Load Balance

lehduong commented 4 years ago

Hi, I'm trying to reproduce the baseline results of A2C and PPO on load_balance environment. However, I find it very difficult to make the agents converge (in my experiments, the agents barely improve after millions of environment steps). Could you kindly clarify the implementation details regarding training these 2 approaches?

Did you normalize the observations/rewards?
Did you clip the observations/rewards?
How did you schedule the max_episode_steps i.e. increasing it over time?
What learning rate did you use?

Thank you very much.

hongzimao commented 4 years ago

Thanks for your interest! I think the most important thing to notice is this environment by its default setting has very long time horizon (#11). The agent (or a shim between environment and the agent) needs to early terminate the episode if you want to implement curriculum learning.

For your questions,

we did normalize observation and reward so that they are roughly in the scale of 1. It's a manual normalization, you can collect some experience data and see their range for normalization.
There's clipping in observation from the environment https://github.com/park-project/park/blob/master/park/envs/load_balance/load_balance.py#L111-L115. I don't recall we clip the reward though.
We set it at some small value (hundreds of steps) and linearly increase (until hundreds of thousands steps I think) over time.
We used 1e-3 but that might differ from framework by framework (e.g., depends on the way you compute policy gradient or loss, whether it's averaged over a batch or sum over a batch). I would sweep some learning rate (e.g., 1e-4, 2e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2, etc.) and see which one gives you fastest convergence with similar performance.

Hope these help!

lehduong commented 4 years ago

Thanks a lot for the helpful clarification.

lesleychou commented 3 years ago

Hi, I'm trying to reproduce the baseline results of A2C and PPO on load_balance environment. However, I find it very difficult to make the agents converge (in my experiments, the agents barely improve after millions of environment steps). Could you kindly clarify the implementation details regarding training these 2 approaches?

Did you normalize the observations/rewards?

Did you clip the observations/rewards?

How did you schedule the max_episode_steps i.e. increasing it over time?

What learning rate did you use?

Thank you very much.

Hi, I also met the same problem that the A2C agent is not improving (I tried both original park and your forked code based on your changes). May I ask if you figured out this issue? Thank you in advance!

hongzimao commented 3 years ago

The original A2C agent is expected to struggle with this environment, because of the variance caused by the random job sequence. We wrote a paper describing this issue in https://openreview.net/forum?id=Hyg1G2AqtQ. You might want to try the agent from this codebase: https://github.com/hongzimao/input_driven_rl_example.

park-project / park

Experimental settings of A2C/PPO agents on Load Balance #13