[RLlib] LSTM Model Cannot be Used with Tuple Observation Spaces

GoodarzMehr commented 8 months ago

What happened + What you expected to happen

I have been using RLlib with a multi-agent CARLA environment (adapted from this integration) where I have a tuple observation space:

Tuple((Box(low=-1.001, high=1.001, shape=(200, 200, 5)),
       Box(low=-1.001, high=1.001, shape=(12, 3))))

Training with PPO or SAC is without any errors when using the following model configuration:

model:
  fcnet_hiddens: [256, 256]
  dim: 200,
  conv_filters: [
    [16, [3, 3], 2],
    [32, [3, 3], 2],
    [32, [3, 3], 2],
    [64, [3, 3], 2],
    [64, [3, 3], 2],
    [128, [3, 3], 2]
  ]
  post_fcnet_hiddens: [256]

However, when I add LSTM to the model, i.e. change it to this (and use either PPO or RNNSAC):

model:
  fcnet_hiddens: [256, 256]
  dim: 200,
  conv_filters: [
    [16, [3, 3], 2],
    [32, [3, 3], 2],
    [32, [3, 3], 2],
    [64, [3, 3], 2],
    [64, [3, 3], 2],
    [128, [3, 3], 2]
  ]
  post_fcnet_hiddens: [256]
  use_lstm: True
  lstm_use_prev_action: True
  lstm_use_prev_reward: True

I get this error:

2024-01-18 00:03:11,078 ERROR tune_controller.py:1374 -- Trial task failed for trial PPO_carla_bbab8_00000
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2626, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=9896, ip=172.30.44.208, actor_id=71a8fd8c61c75a938c0a09a101000000, repr=PPO)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/evaluation/worker_set.py", line 229, in _setup
    self.add_workers(
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/evaluation/worker_set.py", line 616, in add_workers
    raise result.get()
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/utils/actor_manager.py", line 487, in __fetch_result
    result = ray.get(r)
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=10058, ip=172.30.44.208, actor_id=fc75864329e4a284f6ec2a0401000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f6ef2f4b340>)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 535, in __init__
    self._update_policy_map(policy_dict=self.policy_dict)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1746, in _update_policy_map
    self._build_policy_map(
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1857, in _build_policy_map
    new_policy = create_policy_for_framework(
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/utils/policy.py", line 141, in create_policy_for_framework
    return policy_class(observation_space, action_space, merged_config)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 64, in __init__
    self._initialize_loss_from_dummy_batch()
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/policy/policy.py", line 1430, in _initialize_loss_from_dummy_batch
    actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/policy/torch_policy_v2.py", line 572, in compute_actions_from_input_dict
    return self._compute_action_helper(
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/policy/torch_policy_v2.py", line 1293, in _compute_action_helper
    dist_inputs, state_out = self.model(input_dict, state_batches, seq_lens)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/models/modelv2.py", line 266, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 256, in forward
    return super().forward(input_dict, state, seq_lens)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 98, in forward
    output, new_state = self.forward_rnn(inputs, state, seq_lens)
  File "/usr/local/lib/python3.8/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 271, in forward_rnn
    self._features, [h, c] = self.lstm(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py", line 689, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py", line 632, in check_forward_args
    self.check_input(input, batch_sizes)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py", line 205, in check_input
    raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 1410, got 258

1408 seems to be the concatenated output of the CNN (128x3x3) and the FC layers (256), though my understanding is that it has to go through the post FC hidden layer before getting passed to the LSTM. I'm not sure exactly where the dimensions go wrong, and I appreciate your help in resolving this.

Versions / Dependencies

Ray 2.9.0 Torch 1.10.1+cu113 Python 3.8.10

Reproduction script

I'm using the following training script:

import os
import ray
import yaml
import time
import argparse

from tensorboard import program

from ray import air, tune

from ray.tune.registry import register_env

from carla_env import CarlaEnv

argparser = argparse.ArgumentParser(description='CoPeRL Training Implementation.')

argparser.add_argument('config', help='configuration file')
argparser.add_argument('-d', '--directory',
                       metavar='D',
                       default='/home/coperl/ray_results',
                       help='directory to save the results (default: /home/coperl/ray_results)')
argparser.add_argument('-n', '--name',
                       metavar='N',
                       default='ppo_experiment',
                       help='name of the experiment (default: ppo_experiment)')
argparser.add_argument('--restore',
                       action='store_true',
                       default=False,
                       help='restore the specified experiment (default: False)')
argparser.add_argument('--tb',
                       action='store_true',
                       default=False,
                       help='activate tensorboard (default: False)')

args = argparser.parse_args()

def parse_config(args):
    '''
    Parse the configuration file.

    Args:
        args: command line arguments.

    Return:
        config: configuration dictionary.
    '''
    with open(args.config) as f:
        config = yaml.load(f, Loader=yaml.FullLoader)

    return config

def launch_tensorboard(logdir, host='localhost', port='6006'):
    '''
    Launch TensorBoard.

    Args:
        logdir: directory of the saved results.
        host: host address.
        port: port number.

    Return:

    '''
    tb = program.TensorBoard()
    tb.configure(argv=[None, '--logdir', logdir, '--host', host, '--port', port])
    url = tb.launch()

def env_creator(env_config):
    '''
    Create Gymnasium-like environment.

    Args:
        env_config: configuration passed to the environment.

    Return:
        env: environment object.
    '''
    return CarlaEnv(env_config)

def run(args):
    '''
    Run Ray Tuner.

    Args:
        args: command line arguments.

    Return:

    '''
    try:
        ray.init(num_cpus=12, num_gpus=2)

        register_env('carla', env_creator)

        os.system('nvidia-smi')

        if not args.restore:
            tuner = tune.Tuner(
                'PPO',
                run_config=air.RunConfig(
                    name=args.name,
                    storage_path=args.directory,
                    checkpoint_config=air.CheckpointConfig(
                        num_to_keep=2,
                        checkpoint_frequency=1,
                        checkpoint_at_end=True
                    ),
                    stop={'training_iteration': 8192},
                    verbose=2
                ),
                param_space=args.config,
            )
        else:
            tuner = tune.Tuner.restore(os.path.join(args.directory, args.name), 'PPO', resume_errored=True)

        result = tuner.fit().get_best_result()

        print(result)

    except Exception as e:
        print(e)
    finally:
        ray.shutdown()
        time.sleep(10.0)

def main():
    args.config = parse_config(args)

    if args.tb:
        launch_tensorboard(logdir=os.path.join(args.directory, args.name))

    run(args)

if __name__ == '__main__':
    try:
        main()
    except KeyboardInterrupt:
        ray.shutdown()
    finally:
        print('Done.')

with the following configuration file:

framework: 'torch'

env: 'carla'
disable_env_checking: True

num_workers: 1
num_gpus: 1
num_cpus_per_worker: 8
num_gpus_per_worker: 1

train_batch_size: 1024

log_level: 'DEBUG'

ignore_worker_failures: True
restart_failed_sub_environments: False

checkpoint_at_end: True
export_native_model_files: True

keep_per_episode_custom_metrics: True

model:
  fcnet_hiddens: [256, 256]
  dim: 200,
  conv_filters: [
    [16, [3, 3], 2],
    [32, [3, 3], 2],
    [32, [3, 3], 2],
    [64, [3, 3], 2],
    [64, [3, 3], 2],
    [128, [3, 3], 2]
  ]
  post_fcnet_hiddens: [256]
  use_lstm: True
  lstm_use_prev_action: True
  lstm_use_prev_reward: True

The CarlaEnv environment I'm using is not publicly available, though I think the issue can be reproduced with a dummy environment having the same observation space.

Issue Severity

High: It blocks me from completing my task.

GoodarzMehr commented 8 months ago

Update: I think the issue can be fixed (for PPO at least) by changing line 175 of rllib/models/torch/complex_input_net.py into this

post_fcnet_hiddens = model_config.get("post_fcnet_hiddens", [])

if post_fcnet_hiddens:
    self.num_outputs = post_fcnet_hiddens[-1]
else:
    self.num_outputs = concat_size

since concat_size is the size of the model ouput before the final FC hidden layers. For RNNSAC, what seems to have worked in addition to the change above was adding this to line 96 of rllib/algorithms/sac/rnnsac_torch_model.py

if actions is None:
    actions = model_out['prev_actions']

and changing line 370 of rllib/algorithms/sac/rnnsac_torch_policy.py to

q_tp1, _ = target_model.get_q_values(

That said, even with small replay buffer sizes, RNNSAC seems to gobble up RAM so much so that it causes workers to quit due to memory pressure. I would appreciate it if someone could verify these changes.

simonsays1980 commented 5 months ago

@sven1977 We should discuss if and how we could support this in the new stack

ray-project / ray