Closed slerman12 closed 5 years ago
I imitate dopamine to implement a2c and ppo using pytorch. Maybe it can help you. https://github.com/XFFXFF/endorphin
Let's start by thinking of the simplest policy gradient algorithm, what do we need?
Dopamine has some of these things already implemented!
run_experiment.Runner._run_one_episode
to collect trajectories.How would we modify this to do Actor-Critic?
_run_one_episode
command to do _rollout_num_steps
to do a short roll out.Now the hard part is scaling this to batch policy gradient methods so that you can collect data in parallel. With a little bit of careful work this is definitely possible.
The naive way to do this will be to just do _run_one_episode
many times (sequentially) to collect a batch. We need to keep track of when episodes ended to mask them when we compute the returns and losses. The problem with this solution is that we can't make use of batch computation when executing the actions.
For the best performance what you'd have to do is to modify the _run_one_episode
method so that we can execute a batch of actions in multiple environments and collect their experience together. In practice stepping in each environment serially + GPU can might lead to a reasonable runtime for training Atari (I've done this before and it takes about a day to get reasonable results in pong). For best results we probably want to step in parallel.
Thank you so much for this comprehensive answer. I'm still learning how to implement policy gradient methods and am not completely familiar with terminology. Can you tell me what the difference between an episode and a rollout is? As for collecting data in parallel, my experience with parallelism in Python has been pretty rough because of issues with the GIL. Do you know of an (easy to read) example of how to run environments in parallel and collect data in Python? On the conceptual side, my understanding is that the agent would have to be copied to each GPU, or is this wrong? And then at the end of a "rollout" (if I'm using the term correctly) all experiences would be batched together to update the agent globally. Is this correct? I'm afraid using Dopamine might complicate learning how to do these things for me since I'm still inexperienced, but I like that more sources are being made available for reproducible RL.
An episode is one run of the policy in the environment. That is from start state to a terminal state.
A rollout is just a number of steps taken in the environment. So you can think of an episode as a complete rollout. In A2C the rollouts are usually limited to x
steps before the gradient update is applied. The rollout then continues from where it stopped.
Here are some examples of collecting data in Parallel. I do not recommend implementing it yourself unless as an exercise. OpenAI baselines: https://github.com/openai/baselines/blob/master/baselines/common/vec_env/subproc_vec_env.py
On the conceptual side, my understanding is that the agent would have to be copied to each GPU, or is this wrong? And then at the end of a "rollout" (if I'm using the term correctly) all experiences would be batched together to update the agent globally.
In A2C there is no concept of multiple agents on multiple gpus. There's a central agent (maybe on the gpu) that takes steps in many environments at the same time and then the experience is batched at the end of the rollout to do an update to the model.
For a simple version of REINFORCE algorithm you can look here: https://github.com/google-research/policy-learning-landscape/blob/6e32bc480eec6ee2804738ea0340dc2d1091d0d3/eager_pg/algorithms/reinforce.py#L42-L75
to change this into A2C you will need to change collect_trajectories
to do some kind of n-step rollout and handle updating the value function accordingly.
I think Dopamine is not very complicated, but given that it is built for value based RL and that you are learning, it might get confusing to implement it here.
Is this extendable for policy gradient or actor-critic architectures? Or would one have to do major re-workings? I'm trying to decide whether to use this framework for a project or implement from scratch. I will be using A2C. Any advice would be appreciated!