Extendability For Policy Gradients?

slerman12 commented 6 years ago

Is this extendable for policy gradient or actor-critic architectures? Or would one have to do major re-workings? I'm trying to decide whether to use this framework for a project or implement from scratch. I will be using A2C. Any advice would be appreciated!

xffxff commented 6 years ago

I imitate dopamine to implement a2c and ppo using pytorch. Maybe it can help you. https://github.com/XFFXFF/endorphin

zafarali commented 6 years ago

Let's start by thinking of the simplest policy gradient algorithm, what do we need?

A way to collect trajectories from the environment.
A way to calculate the return.
The policy gradient loss: log pi(a|s) * G_t

Dopamine has some of these things already implemented!

We can use run_experiment.Runner._run_one_episode to collect trajectories.
I could not find a way to calculate the return built in but this is easy to implement.
Easy to implement, just a few tensorflow ops within the Agent class, or could use https://github.com/deepmind/trfl that has all these losses implemented.

How would we modify this to do Actor-Critic?

We'd modify the _run_one_episode command to do _rollout_num_steps to do a short roll out.
Train a value function to bootstrap the return, also possible to do if we reuse the data from the rollout.
Change the loss to train both the policy and the value function.

Now the hard part is scaling this to batch policy gradient methods so that you can collect data in parallel. With a little bit of careful work this is definitely possible. The naive way to do this will be to just do _run_one_episode many times (sequentially) to collect a batch. We need to keep track of when episodes ended to mask them when we compute the returns and losses. The problem with this solution is that we can't make use of batch computation when executing the actions.

For the best performance what you'd have to do is to modify the _run_one_episode method so that we can execute a batch of actions in multiple environments and collect their experience together. In practice stepping in each environment serially + GPU can might lead to a reasonable runtime for training Atari (I've done this before and it takes about a day to get reasonable results in pong). For best results we probably want to step in parallel.

slerman12 commented 6 years ago

Thank you so much for this comprehensive answer. I'm still learning how to implement policy gradient methods and am not completely familiar with terminology. Can you tell me what the difference between an episode and a rollout is? As for collecting data in parallel, my experience with parallelism in Python has been pretty rough because of issues with the GIL. Do you know of an (easy to read) example of how to run environments in parallel and collect data in Python? On the conceptual side, my understanding is that the agent would have to be copied to each GPU, or is this wrong? And then at the end of a "rollout" (if I'm using the term correctly) all experiences would be batched together to update the agent globally. Is this correct? I'm afraid using Dopamine might complicate learning how to do these things for me since I'm still inexperienced, but I like that more sources are being made available for reproducible RL.

zafarali commented 6 years ago

An episode is one run of the policy in the environment. That is from start state to a terminal state. A rollout is just a number of steps taken in the environment. So you can think of an episode as a complete rollout. In A2C the rollouts are usually limited to x steps before the gradient update is applied. The rollout then continues from where it stopped.

Here are some examples of collecting data in Parallel. I do not recommend implementing it yourself unless as an exercise. OpenAI baselines: https://github.com/openai/baselines/blob/master/baselines/common/vec_env/subproc_vec_env.py

On the conceptual side, my understanding is that the agent would have to be copied to each GPU, or is this wrong? And then at the end of a "rollout" (if I'm using the term correctly) all experiences would be batched together to update the agent globally.

In A2C there is no concept of multiple agents on multiple gpus. There's a central agent (maybe on the gpu) that takes steps in many environments at the same time and then the experience is batched at the end of the rollout to do an update to the model.

For a simple version of REINFORCE algorithm you can look here: https://github.com/google-research/policy-learning-landscape/blob/6e32bc480eec6ee2804738ea0340dc2d1091d0d3/eager_pg/algorithms/reinforce.py#L42-L75 to change this into A2C you will need to change collect_trajectories to do some kind of n-step rollout and handle updating the value function accordingly.

I think Dopamine is not very complicated, but given that it is built for value based RL and that you are learning, it might get confusing to implement it here.

google / dopamine

Extendability For Policy Gradients? #45