[rllib] action from policy with Tuple action space has wrong shape

What is the problem?

I have an environment with a tuple action space. During training, there is no problem. However, when I attempt to demonstrate the learned policy, I get actions of the wrong shape. Like in #3048, I get actions that are 2d arrays instead of the expected 1d arrays.

Ray version and other system information (Python version, TensorFlow version, OS): Ray: 0.8.5 Python: 3.7.7 TF: 2.3.0 OS: Mac 10.14

Reproduction (REQUIRED)

# Dummy test case

import gym
from gym.spaces import Tuple, Box, Discrete
import numpy as np

class TupleCorridorEnv(gym.Env):
    def __init__(self, config={}):
        self.size = 5

        self.observation_space = Box(low=0, high=self.size-1, shape=(2,), dtype=np.int)
        self.action_space = Tuple((Discrete(2), Box(low=0, high=1, shape=(2,), dtype=np.int)))

    def reset(self):
        self.num_steps = 0
        self.pos = np.array([0, 0])
        return self.pos

    def step(self, action):
        self.num_steps += 1
        movement = action[1] # The second array in the tuple
        self.pos[0] += movement[0]
        self.pos[1] += movement[1]
        if self.pos[0] == self.size-1 and self.pos[1] == self.size-1:
            return self.pos, 1, True, {}
        else:
            if self.num_steps >= 10:
                return self.pos, -1, True, {}
            else:
                return self.pos, 0, False, {}

ray_tune = {
    'run_or_experiment': 'PG',
    'checkpoint_at_end': True,
    'stop': {
        'episodes_total': 2,
    },
    'config': {
        'env': TupleCorridorEnv,
        'env_config': {},
    }
}

import ray
from ray import tune
ray.init()
tune.run(**ray_tune)

alg = ray.rllib.agents.registry.get_agent_class('PG')
agent = alg(
    env = TupleCorridorEnv,
    config = {},
)
env = TupleCorridorEnv()

# agent.restore(...), not needed to reproduce the error
obs = env.reset()
while True:
    action = agent.compute_action(obs) # Get the action

    print('\nAction is: ')
    print(action)
    print('\n')

    obs, reward, done, info = env.step(action)
    if done == True:
        break

ray.shutdown()

Notice that the printed action is a tuple where the second element is a 2d array instead of a 1d array, just like #3048. This only appears to happen via agent.compute_action(obs) and not during training.

If we cannot run your script, we cannot fix your issue.

[X] I have verified my script runs in a clean environment and reproduces the issue.
[ ] I have verified the issue also occurs with the latest wheels.
I have not upgraded to ray 0.8.7 because that introduces this bug #10100

ray-project / ray

[rllib] action from policy with Tuple action space has wrong shape #10516

What is the problem?

Reproduction (REQUIRED)