rlworkgroup / garage

A toolkit for reproducible reinforcement learning research.
MIT License
1.86k stars 309 forks source link

Discrete action-space for Meta-RL algorithms #2214

Open maniset opened 3 years ago

maniset commented 3 years ago

Hello,

Thanks for this great library.

I have a question. I want to use RL2 and specifically the RL2TRPO algorithm with discrete action-space. However, it seems that the current implementation doesn't support discrete action-space. I think this is also true for MAML. Is there any method to use a discrete action-space with Meta-RL algorithms (RL2 and MAML)?

I would appreciate your help on this matter.

Sincerely

ryanjulian commented 3 years ago

@maniset I'm not sure that the current RL2 and MAML implementations don't support discrete actions spaces. Can you provide a minimal code snippet and error message which replicates the failure?

maniset commented 3 years ago

Thanks for your response.

I have implemented a new environment, and it is working without any problem when I use continuous action space (with RL2TRPO). However, when I switch to discrete space in the same environment, it gives me the following error:

TypeError                                 Traceback (most recent call last)
~/Code/Meta-RL/agent.py in 
     67 
     68 
---> 69 rl2_trpo()

~/miniconda3/lib/python3.8/site-packages/garage/experiment/experiment.py in __call__(self, *args, **kwargs)
    367         else:
    368             ctxt = self._make_context(self._get_options(*args), **kwargs)
--> 369             result = self.function(ctxt, **kwargs)
    370             logger.remove_all()
    371             logger.pop_prefix()

~/Code/Meta-RL/agent.py in rl2_trpo(ctxt, seed)
     37         tasks = task_sampler.SetTaskSampler(meta_env, wrapper=lambda env, _: RL2Env(meta_env()))
     38 
---> 39         env_spec = RL2Env(meta_env()).spec
     40 
     41         # policy = GaussianGRUPolicy(name='policy', hidden_dim=hp['hidden_dim'], env_spec=env_spec, state_include_action=False)

~/miniconda3/lib/python3.8/site-packages/garage/tf/algos/rl2.py in __init__(self, env)
     33         super().__init__(env)
     34 
---> 35         self._observation_space = self._create_rl2_obs_space()
     36         self._spec = EnvSpec(
     37             action_space=self.action_space,

~/miniconda3/lib/python3.8/site-packages/garage/tf/algos/rl2.py in _create_rl2_obs_space(self)
    105         obs_flat_dim = np.prod(self._env.observation_space.shape)
    106         action_flat_dim = np.prod(self._env.action_space.shape)
--> 107         return akro.Box(low=-np.inf,
    108                         high=np.inf,
    109                         shape=(obs_flat_dim + action_flat_dim + 1 + 1, ))

~/miniconda3/lib/python3.8/site-packages/gym/spaces/box.py in __init__(self, low, high, shape, dtype)
     41 
     42         if np.isscalar(low):
---> 43             low = np.full(shape, low, dtype=dtype)
     44 
     45         if np.isscalar(high):

~/miniconda3/lib/python3.8/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order)
    312     if dtype is None:
    313         dtype = array(fill_value).dtype
--> 314     a = empty(shape, dtype, order)
    315     multiarray.copyto(a, fill_value, casting='unsafe')
    316     return a

TypeError: 'numpy.float64' object cannot be interpreted as an integer

I think it is related to:

action_flat_dim = np.prod(self._env.action_space.shape)

in rl2.py. Because if I set this manually to an integer number, this doesn’t give me any error, but another line gives a similar error in rl2.py.

first_obs = np.concatenate([first_obs, np.zeros(self._env.action_space.shape), [0], [0]])

The error is:

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)

If I also change this to a list (e.g. [1]), the npo.py will give an error.

actions = [self._env_spec.action_space.flatten_n(act) for act in episodes.actions_list
…

I believe the problem is related to “self._env.action_space.shape”. It only works with akro.Box() and if I change it to akro.Discrete(), RL2TRPO will not work anymore.

For MAML, I have another problem. The garage.torch.policies just has one categorical policy with CNN, which I can’t use in my environment, and I need something like CategoricalMLPPolicy in TensorFlow.

Thanks for your help.

RicardoLunaG commented 3 years ago

I have the same issue, it seems that MAML does not work with DeterministicMLPPolicy. Did you find any work around?

ryanjulian commented 3 years ago

@avnishn has worked with MAML recently, he may be able to shed some light. @krzentner has worked extensively with flattening/unflattening and might have a more global view of what is going on.

Generally, if some primitive you'd like (e.g. "CategoricalMLPPolicy in TensorFlow") is missing, it should be pretty easy to implement by looking at the existing primitives for guidance. We welcome PRs with these kind of contributions.

avnishn commented 3 years ago

Hi @RicardoLunaG can you give me some more details that way I can reproduce your bug?

Can you post in a GitHub gist, the launcher that you used with MAML and the DeterministicMLPPolicy?

Thanks, @avnishn

krzentner commented 3 years ago

For RL2, the issue here is really that the existing RL2 environment wrapper assumes the inner environment has a continuous / box observation space of only one dimensions. Optimally it would use a Dict observation space instead, but in any case it would need custom policies to hand the original observation and the injected information from the wrapper.

I don't know why MAML wouldn't work. AFAICT our MAML implementation should work with any garage.torch.StochasticPolicy.