Using RNN for RL - Githubissues

anirjoshi commented 4 months ago

Description

I can see that RLLib supports the use of RNN. It would be great to have an example that shows the use if RNNs for an environment in RLLib. I would like to implement an RNN for a custom made environment. So, it would be great to have an example that shows this, so that I can use the example to further customize the implementation. Thanks!

Use case

I can see that RLLib supports the use of RNN. It would be great to have an example that shows the use if RNNs for an environment in RLLib. I would like to implement an RNN for a custom made environment. So, it would be great to have an example that shows this, so that I can use the example to further customize the implementation. Thanks!

simonsays1980 commented 4 months ago

@anirjoshi Basically you just set "use_lstm": True in the model dictionary in AlgorithmConfig.training(). See here for a more elaborate example: https://github.com/ray-project/ray/blob/master/rllib/examples/custom_recurrent_rnn_tokenizer.py

You don't need the custom tokenizer shown there, just set "use_lstm": True.

anirjoshi commented 4 months ago

@simonsays1980 Thank you for your quick message. This example seems to be of Tensor Flow, it would be great to see an example of PyTorch. Also, I am not very sure about the terminology. I am not super familiar with RNNs, I just have an MDP defined, and a state is represented by a variable input size and an action is represeted by size 3 output of the RNN. How easy would it be to use RLLib for this? I was hoping to see some example of this.

anirjoshi commented 4 months ago

@simonsays1980 In particular I have constructed the following environment as an example. Note that this environment has variable size in the inputs!

class ModuloComputationEnv(gym.Env):
    """Environment in which an agent must learn to output mod 2,3,4 of the sum of
       seen observations.

    Observations are squences of integer numbers ,
    e.g. (1,3,4,5)

    The action space is just 3 values first for the sum of inputs till now %2, second %3 
    and third %4.

    Rewards are r=-abs(self.ac1-action[0]) - abs(self.ac2-action[1]) - abs(self.ac3-action[2]), 
    for all steps.
    """

    def __init__(self, config):

        #the input sequence can have any number from 0,99
        self.observation_space = Sequence(Discrete(100), seed=2)

        #the action is a vector of 3, [%2, %3, %4], of the sum of the input sequence
        self.action_space = MultiDiscrete([2,3,4])

        self.cur_obs = None

        #this variable maintains the episode_length
        self.episode_len = 0

        #this variable maintains %2
        self.ac1 = 0

        #this variable maintains %3
        self.ac2 = 0

        #this variable maintains %4
        self.ac3 = 0

    def reset(self, *, seed=None, options=None):
        """Resets the episode and returns the initial observation of the new one.
        """

        # Reset the episode len.
        self.episode_len = 0

        # Sample a random sequence from our observation space.
        self.cur_obs = self.observation_space.sample()

        #take the sum of the initial observation
        sum_obs = sum(self.cur_obs)

        #consider the %2, %3, and %4 of the initial observation
        self.ac1 = sum_obs%2
        self.ac2 = sum_obs%3
        self.ac3 = sum_obs%4

        # Return initial observation.
        return self.cur_obs, {}

    def step(self, action):
        """Takes a single step in the episode given `action`

        Returns:
            New observation, reward, done-flag, info-dict (empty).
        """
        # Set `truncated` flag after 10 steps.
        self.episode_len += 1
        truncated = False
        terminated = self.episode_len >= 10

        #the reward is the negative of further away from computing the individual values
        reward = abs(self.ac1-action[0]) + abs(self.ac2-action[1]) + abs(self.ac3-action[2])
        reward = -reward

        # Set a new observation (random sample).
        self.cur_obs = self.observation_space.sample()

        #recompute the %2, %3 and %4 values
        self.ac1 = (self.cur_obs+self.ac1)%2
        self.ac2 = (self.cur_obs+self.ac2)%3
        self.ac3 = (self.cur_obs+self.ac3)%4

        return self.cur_obs, reward, terminated, truncated, {}

I would like to use the RLLib library for training some RL algorithm, is it possible? Some help in this regards would be great!

simonsays1980 commented 3 months ago

@anirjoshi Your example environment should work with RLlib as long as it implements the gymnasium.Env interface. The use_lstm key works for TF or Torch (the framework can be set in the Algorithm.framework("torch") - the latter sets it to Torch).

You can find here all possible settings for modules in RLlib. Find here an overview over our auto-lstm wrappers that are triggered by the use_lstm setting.

In regard to the example linked above by me: this is for TF AND Torch. You can run it from the command line and pass in the argument --framework=torch. I advice you to read carefully through the code and documentation to grow your understanding of how to use RLlib for your experiments.

ray-project / ray

Using RNN for RL #43420

Description

Use case