pytorch / examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
https://pytorch.org/examples
BSD 3-Clause "New" or "Revised" License
22.3k stars 9.53k forks source link

RL Algorithms #14

Closed soumith closed 7 years ago

soumith commented 7 years ago

@ludc and @korymath are interested in building out some RL algorithms and doing OpenAI Gym integration.

Kory from his repo: https://github.com/korymath/examples/tree/master/rl hasn't yet started on anything concrete.

If each of you declares here what you are doing, before you start developing it, then i think the other person can avoid overlap.

ludc commented 7 years ago

Hello,

I just discovered pytorch yesterday, so I still have to go indepth, but my first intention was to evaluate if I can recode the rltorch package with pytorch. Basically, rltorch is very simple but more general than openAI Gym since it allows one to decompose any problem in environment, sensor, feedback and policy and thus can be also used for other problems (like supervised classification with RL, etc...). Then, making a openAI Gym wrapper is very easy. In that case, one way to test the platform is to reimplement one of the environments (e.g cartpole) and to compare the implemented environment with the openAI Gym one. I think that this could be done in a few days.

The second point concerns the implementation of policy gradient algorithms. I think that the 'dpnn' mechanism is very nice. I suppose it can be implemented in pytorch by extending the 'Container' and/or 'Module' classes to incorporate a 'reinforce' method (but I have to check to be sure)

In term of algorithms, I would like to start with: policy gradient, recurrent policy gradient, predictive policy (see rltorch), ucbpolicy (for online learning), and imitation policy (supervised)

What do you think ?

apaszke commented 7 years ago

Hey,

So I've been also playing with some RL lately and I started decomposing the code into different chunks and thinking how would they fit into pytorch. I think right now I ended up with something quite closely following the design of rltorch. However, if this framework works well in all/most RL cases, we could go a step further and also add define an RLTrainer class.

Right now, we have only a basic torch.utils.trainer.Trainer class, but it's definitely too general for this. I like the decomposition of the problem into World, Sensor and Policy. I only don't understand what Feedback is for (is it a generalization of a reward? do you every give non-scalar rewards?). I'd propose that we start posting code samples. My initial design is the following:

class Sense(object):
    def __init__(self, env):
        self.env = env

    def observe(self):
        raise NonImplementedError

class Environment(object):
    def __init__(self):
        self.actions = set()

    def take_action(self, action):
        # update environment state
        return feedback

    # Optionally a number of senses you can use to observe the world

class Agent(object):

    def __init__(self, env):
        # initialize internal models
        # gather senses from env
        pass

    def forward(self):
        # use envs to observe the environment state
        # create input for the internal models
        # predict action / choose random one
        pass

    def backward(self, feedback):
        # generate gradients for internal models
        # can use experience replay instead of feedback
        pass

    def new_session(self):
        # clear any saved state e.g. screen history
        pass

# I'm omitting all the plugin-calling code for clarity
class RLTrainer(Trainer):

    def __init__(self, optimizer, env, agent, session_length=1000):
        self.optimizer = optimizer
        self.env = env
        self.agent = agent

    def train(self):
        self.agent.new_session()
        for i in range(session_length):
            action = self.agent.forward()
            feedback = self.env.take_action(action)
            optimizer.zero_grad()
            agent.backward(feedback)
            optimizer.step()

An example implementation:

class GameEnvironment(Environment):
    def __init__(self):
        self.actions = set(MOVE_FORWARD, MOVE_BACKWARD, SHOOT)
        self.game = ... # initialize the game engine

    def take_action(self, action):
        self.game.update(action)

    def screen_buffer(self):
        return _ScreenBuffer(self)

    class _ScreenBuffer(Sense):
        def __init__(self, env):
            self.env = env

        def observe(self):
            return torch.FloatTensor(self.env.game.screen_buffer)

        def size(self):
            return torch.Size([1, 3, 100, 100])

class DQNAgent(Agent):
    def __init__(self, env):
        self.actions = env.actions
        self.screen_buffer = env.screen_buffer()
        self.dqn = DQN(self.screen_buffer.size(), len(self.actions))
        self.last_screens = ...
        self.replay_memory = ...

    def forward(self):
        self.remember_frame(self.screen_buffer.observe()) # updates self.last_frames

        if self._should_pick_random():
            action_idx = random() * self.num_actions
        else:
            action_idx = self._predict()

        self.last_action = self.actions[action_idx]
        return self.last_action

    def _should_pick_random():
        return random() > threshold

    def _predict(self):
        output = self.dqn.forward(self.last_frames)
        return output.max(1)[1]

    def backward(self, feedback):
        self.replay_memory.store(
                self.last_frames,
                self.last_action,
                feedback,
                self.screen_buffer.observe()
        )
        # sample from replay_memory and do backward on the dqn

Also, could you please point me to the code of dpnn you're referring to? I haven't done any RL with Lua Torch, so I don't know what was the design.

ludc commented 7 years ago

OK, some differences between what you propose and rltorch:

I think it is interesting to have a separate class for defining the task to solve,. Here is the code I imagine:

  `class Feedback(object): 
    def __init__(self, env):
        self.env = env

    def feedback(self,env):
        raise NonImplementedError`

    def finished(self,env):
        raise NonImplementedError`

Concerning the finished method, I think that one clever thing would be to define a method that list all authorized actions (or authorized domains for continuous RL). If this method returns and empty set, then the episode is finished. Other types of feedback can be defined (see https://github.com/ludc/rltorch/tree/master/torch/environments/classiclearning/classification)

The second point concerns the definition of the agent (or policy). Since the agent can receive different types of feedback (at different time steps in the process), here is what I imagine:

 def new_episode(self,information):
 #informations contains additionnal information for the intialization of the agent

def observe(self,observation):

def feedback(self,feedback):
#can be called whenever during the life of the agent, and multiple times

def sample(self):
#Sample an action

def end_episode(self,feedback):

def reset(self):

This definition is very general, and then can be instantiated for particular 'gradient-based' agent (what you propose in your previous post)

Concerning the RLTrainer, I have no preference....

I can write all these classes in 'real' python during the week end, but it will be almost the same thing that the core classes of rltorch.

How do you want to proceed ?

ludc commented 7 years ago

Concerning the dpnn approach, the idea is the following:

You make a model that includes stochastic layers (for example, the MultinomialLayer takes a discrete set of pobabilities as an input and output a one hot vector by sampling following this distribution). Then the gradient can be estimated by providing a ''reward-like'' feedback to this layer before the backward.

So, imagine you have a loss function, an input x and an output y to predict, you can do:


m1=Linear(....)
m2=SoftMax(...)
m3=MultinomialLayer(...)
m4=Linear(...)

a=m1:forward(x)
b=m2:forward(a)
c=m3:forward(b) #Here the sampling is made
d=m4:forward(c)

err=loss:forward(d,y)
m3:reinforce(-err) #Here, the reward is provided to the stochastic modules. It will allow to compute the derivative of the term log P(sample|input) * reward that will be backpropagated
delta=loss:backward(d,y)
delta=m4:backward(....)
delta=m3:backward(....)
delta=m2:backward(....)
delta=m1:backward(....)

When using with reinforcement, you directly have a reward (no loss), so you start your backpropagation with a empty delta, but the idea remains the same. (see https://github.com/Element-Research/dpnn#nn.Reinforce)

Basically, the goal is to include stochastic modules into the computation graph, and this can be done by adding a reinforce method for these modules (but other ways can be imagined I suppose)

korymath commented 7 years ago

Looks good, nice discussion guys.

Thanks for making the connection @soumith.

The way we did this in https://github.com/twitter/torch-twrl is a little bit different.

Here we have an agent defined by a learning update, a model and a policy. The agent is completey separate from the environment, and from the monitoring code. This separation allows for building in modular chunks. You can see a nice visualization here: https://blog.twitter.com/2016/reinforcement-learning-for-torch-introducing-torch-twrl

I have been working on DDPG in pytorch, and will try to model it after the breakdown on the twrl package. Twrl was modelled after the RLLab and torch-rl.

We should aim to enable capabilities expected by OpenAI Gym as it is the common test bed these days. I have been working on a simple continuous action space example with DDPG.

Not sure if this is helpful, but figured that an easy implementation of a common, popular RL algorithm on OpenAI gym would be the most effective example for RL on pytorch

ludc commented 7 years ago

Concerning the separation between agent, environment, monitoring code, I totally agree. Concerning enabling a strong (and easy) connection with openAI Gym, I agree also.

So, I think that we just have to agree on simple core classes (actually, the way PG or other algorithms will be implemented is a totally separate problem), right ?

ludc commented 7 years ago

Concerning the core classes (not agent/policy here), this is what I have in mind, considering the structure proposed by @apaszke ?

class Sense(object):
    def __init__(self, env):
        pass

    #I think that it is important to put the environment here since the same sensor can be used on different copies of an environment
    def observe(self,env):
        raise NonImplementedError

    #A description of the space of the data returned by the sensor
    def sense_space(self):
        raise NonImplementedError

class World(object):
    def __init__(self):
        self.actions = set()

    def take_action(self, action):
        # update environment state. No feedback (see Feedback class)
        pass

    def reset(self,info):
        #reset the environment
        #info: additional information. Can be used for example to generate mazes with different difficulty levels
        pass

    def clone(self):
        #clone the environment in its current state
        pass

class Feedback(object):
    def __init__(self,env):
        pass

    def feedback(self,env):
        #return the feedback associated with the env. Here, the feedback is provided when the env state is reached i.e r(s)
        pass

    def feedback_action(self,env,action):
        # return the feedback associated with the state if the action 'action' is taken. Here, the feedback will be computed before modifying the env i.e r(s,a)
        pass

    def finished(self,env):
        #The episode is finished. No more action allowed
        pass

    def possible_actions(self,env):
        #The set of actions that can be applied
        pass

class GymEnvironment( open ai gym class):
    #RLEnvironment is the adapter to openAIGym.

    def __init(self,world,feedback,sense):
        pass

    #See openai gym definition
apaszke commented 7 years ago

@ludc About the stochastic modules, it's actually not as simple. It could work, however you'd need to perform the work of :reinforce in backward and treat grad_output as err. New nn isn't a standalone module, but rather relies heavily on machinery of torch.autograd, and that is designed for graphs of differentiable ops (they need to define forward and backward). You never call these backward methods yourself, you only do it once, on a variable, and it bootstraps the whole gradient computation and passes the control to the ExecutionEngine.

Some comments to the API you proposed:

If you're feeling like reimplementing rltorch or sth similar so soon, then go on. I will have to work on some other stuff as well, so I can wait and review the changes if you wish.

@korymath I'll try to take a look at torch-twrl soon and see how it feels.