hongzimao / input_driven_rl_example

Variance Reduction for Reinforcement Learning in Input-Driven Environments (ICLR '19)
https://people.csail.mit.edu/hongzi/var-website/index.html
MIT License
31 stars 10 forks source link

Implementation on ABR algorithm #4

Closed allen4747 closed 4 years ago

allen4747 commented 4 years ago

Thank you, Hongzi, for such a great work!

I have tried to use the multi-critic algorithm based on Pensieve structure. However, you mentioned pregenerated input seuqnces are required as the training input. How should I process my input in the ABR environment? Do I need to sample a fized N traces and use N critics to train the RL model? In this case, I think the training data will be constrained in these N traces.

Can you explain how to implement the algo in ABR environment?

Thanks!

hongzimao commented 4 years ago

Yes, your understanding is correct - we need to sample a fixed set of N traces and create N corresponding critics for them. The implementation would be similar to https://github.com/hongzimao/input_driven_rl_example/blob/master/load_balance_actor_multi_critic_train.py#L219-L225, where we pick a trace (controlled by a random seed) for the environment and map to a fixed critic (indexed by m). The hope for this approach is that the agent can generalize its policy when N gets large (a more scalable approach would be meta-learning and we provide an implementation in #2 if you are interested).

Along this line of idea, a different approach (perhaps easier to implement) is to (1) sample a trace, (2) sample the rollout with this fixed trace multiple times (you can do it with parallel actors to speed up), (3) compute the baseline using time-average total reward from these rollouts, (4) do policy gradient and repeat (1) with a different trace. This approach would be "critic-network-free" but capture the essence of the "input-driven baseline". We used this simple approach in this work: https://openreview.net/forum?id=SJlCkwN8iV.

allen4747 commented 4 years ago

@hongzimao
Thanks for your quick reply.

I am still a little bit confused about how to get the baseline. Now, my input is a trace in batch format. The baseline is calculated by sampling the rewards from the batch and averaging them. Am I right?

hongzimao commented 4 years ago

To be more specific - it's a "time-"average total reward from the batch. I don't get what you meant by "trace in batch format" though.

What we did was, at each training iteration, we fix an input trace, and run a batch of MDP trajectories (agent interact with the environment with the same external input sequence). Now in this batch, we compute the baseline value b_t at step t using average total reward R_t (R_t = rt + r{t+1} + ...) from all trajectories (i.e., average of R_t_traj_1, R_t_traj_2, ...).

allen4747 commented 4 years ago

Understand! I changed my setting, and the experimental results showed the effectiveness of this approach. Thank! Respect!