Baseline implementations seem non-modular

Ploppz commented 5 years ago

I would just like to share my experience after working with the ppo2 PPO implementation. I happily welcome objections, in case I just haven't found the right way to work with the implementations.

The ppo2 interface is the learn() function, which seems to me very opaque. Why not offer some PPO agent to which you can just feed data, and which will use it to learn?

Example 1: When wanting to train Acrobot-v1 using ppo2, I wanted to plot the total reward for each episode. You can find the code in https://github.com/openai/baselines/issues/771 - there you can see that I had to (or just missed some easier solution?) make a wrapper around the environment to capture the rewards experienced by the agent.

Example 2: I wanted to try out RND (https://arxiv.org/abs/1810.12894) on the same environment. They use PPO in the paper, so I tried to use the ppo2 implementation. Also in this case (and maybe more justified than in Example 1), I found that I had to make a wrapper around the environment, in order to ignore the extrinsic reward (they don't do that in the paper but I'm not interested in it), and instead calculate the intrinsic reward. Here is the env wrapper:

class RNDEnv:
    def __init__(self, inner_env, agent):
        self.env = inner_env
        self.agent = agent
        self.observation_space = inner_env.observation_space
        self.action_space = inner_env.action_space

    def step(self, action):
        state_next, extrinsic, terminal, info = self.env.step(action)
        intrinsic = self.agent.visit(state_next)
        return state_next, intrinsic, terminal, info

    def reset(self):
        state = self.env.reset()
        return state

    def render(self):
        return self.env.render()

Where agent.visit(state) calculates the intrinsic reward in that state, that is, the prediction error between a target network and a predictor network as defined in the RND paper.

I also want to add epsilon exploration, but I have not yet found a way to do this.

I think this is a serious issue for the usefulness of this repository. In my opinion, it would be better to define agents which you feed data (next states, reward), rather than making monolithic functions that do everything for you. Such a change would be useful whenever one wants to make an agent that has for example PPO only as a part of its internal working, or if one wants to modify the reward (reward engineering or research on intrinsic reward). Or just for having more control over what happens, which is especially useful in research.

pzhokhov commented 5 years ago

Hi @Ploppz ! While I agree with your general criticism that baselines should be more modular (in fact, we are working on making it so), I don't think the idea of providing agents that one feeds the data is easily extendable to cover all use cases. Specifically, different algorithms require different data from the environment / agent. off-policy algorithms (DQN, DDPG) require only transition tuples (s_t, a_t, rt, s{t+1}). Policy gradient based algorithms require chunks of trajectories, along with negative log likelihood of actions taken, and the value function estimates of the states. The data may be coming from one or multiple agents interacting with respective environments, and the algorithm has to be aware of that (for instance if recurrent policies are used and the states of RNN cells have to be managed). In a sense, to me it seems like the environment itself is the right abstraction for what you call "just feed the data" - the agent supplies actions, the environment returns reward and next observations. Another modular piece that can be modified / swapped (although not as easily as environment) is the Runner (see the comment to the issue #771 ) - so I would argue that that part is modular also. Usage of PPO2 as a part of another algorithm - I suppose that depends quite a bit on a use case; however, what we can do (in fact it is rather straightforward), is split model setup, generating a rollout, training on a single batch of data and the entire training loop into separate functions that can be used independently... Is that the kind of modularity that you had in mind?

All of that being said, the right abstractions for RL are by no means a solved problem; so if you have ideas how to make things more modular (replaceable, testable - all those good SWE practices) feel free to post them or submit a PR directly.

pzhokhov commented 5 years ago

I am gonna keep this open for now, but at the moment I don't think there is anything actionable wrt this issue

Ploppz commented 5 years ago

I see now that indeed it's a rather hard problem to make it modular, and that my idea is too simple. Maybe indeed the environment is conceptually an okay wrapper. Especially keeping in mind that the boundary between the "agent" and the "environment" conceptually is rather fuzzy. (just think about how animals' reward system is in their brain)

I'm sorry I can't come up with much useful ideas, especially because I haven't looked too much into what is usually done in such an algorithm (e.g. batching, and as you said: "chunks of trajectories, along with negative log likelihood of actions taken, and the value function estimates of the states").

About what you said that policy-gradient algorithms require - "chunks of trajectories, along with negative log likelihood of actions taken, and the value function estimates of the states" - Please forgive me if the following is a naive remark: I don't really see why collections of trajectories, calculation of negative log likelihood and all that, cannot be done inside a hypothetical PPOAgent class that only has an input stream of (s, a, s', r) (say, manifested as a member function step(s, a, s', r)). Doesn't it then get all the information it needs? What more information is there? Having multiple agents acting in parallel though, is still a problem.

pzhokhov commented 5 years ago

Theoretically - you are right, just storing old actions should be sufficient; however, practically, computing neglog probability and action can be done in a single pass through neural net, whereas if you just store actions, you'll have to do the computation again. Also, for ppo you need the negative log likelihoods of old actions at the time when the action was taken (i.e. with the policy parameters at the time of acting) - so if you don't store them outside of the ppo, you'll have to somehow either two multiple copies of the parameters - the one corresponding to the action time, and the current one, or do a separate pass computing neglogp's of actions before computing the gradients.

openai / baselines

Baseline implementations seem non-modular #772