Closed RuihangWang closed 4 years ago
Can you sanity check A2C improves from a random policy after 1e8 time steps? If the policy was improving, we need to tune the training process; if not, the A2C implementation itself might have some problem.
You can try a few techniques like curriculum learning (short episode in the beginning of training then gradually increase the episode length to make the learning task more and more challenging, similar to section 5.3 challenge 1 of this paper http://people.csail.mit.edu/hongzi/content/publications/Decima-Sigcomm19.pdf). You can also implement an input-dependent baseline (https://openreview.net/forum?id=Hyg1G2AqtQ). For this problem, you can do a time-dependent baseline for simplicity: just make the parallel agent experience the same input sequence (i.e., job sequence) and compute the baseline return on each time step by averaging on the same time step across these parallel agents.
These may also be helpful: #13 #11 #16
And it's always a good idea to try with something simpler: just two servers to choose and you fix the job arrival pattern. See if the baseline A2C can at least learn a greedy policy.
Thanks for your response!
Actually, I tried to use behaviour cloning to let the A2C agent mimic the greedy policy at first. Then the agent is re-trained using on-policy data. The reward is almost the same as the greedy baseline after behaviour cloning. However, the reward drops down with increasing time steps after training using the on-policy data.
It seems the randomness of the job arrival is an issue here. I will read your Variance Reduction paper and try to find a solution.
Hi, Hongzi.
I am trying to reproduce the results of load balancer using A2C from stable-baselines. I followed the experimental settings described in the ICML paper, and also clipped the gradients by setting the max_grad_norm to 10. However, the performance of the trained A2C agent cannot beat the baseline greedy policy after 1e8 time steps.
The greedy policy achieves episode reward -7e5 while the A2C agent only achieves -2.74e6 for the best test. It seems the greedy policy is a better strategy compared with DRL.
Maybe the default settings of other hyperparameters are not suitable. I am wondering if there are other tricks when training the load balancer to achieve satisfied performance. Looking forward to your reply. Thanks!