mit-acl / gym-collision-avoidance

MIT License
242 stars 74 forks source link

env.step(action) list size #6

Closed krishna-bala closed 3 years ago

krishna-bala commented 3 years ago

I'm attempting to train a new policy in the "CollisionAvoidance-v0" env using the A2C model from stable-baselines.

Following documentation instructions, I created the following files:

class LearningPolicyA2C(LearningPolicy):
    """ The A2C-CADRL policy while it's still being trained
    """
    def __init__(self):
        LearningPolicy.__init__(self)
        # self.possible_actions = network.Actions()

    def external_action_to_action(self, agent, external_action):
        """ Convert the discrete external_action into an action for this environment using properties about the agent.

        Args:
            agent (:class:`~gym_collision_avoidance.envs.agent.Agent`): the agent who has this policy
            external_action (int): discrete action between 0-11 directly from the network output # NOT A DISCRETE ACTION

        Returns:
            [speed, heading_change] command

        """
        heading_change = agent.max_heading_change*(2.*external_action[1] - 1.)
        speed = agent.pref_speed * external_action[0]
        actions = np.array([speed, heading_change])
        # raw_action = self.possible_actions.actions[int(external_action)]
        # action = np.array([agent.pref_speed*raw_action[0], raw_action[1]])
        return actions
class Train (Config):
    def __init__(self):
        self.MAX_NUM_AGENTS_IN_ENVIRONMENT = 2
        self.MAX_NUM_AGENTS_TO_SIM = 2
        Config.__init__(self)
        self.TRAIN_SINGLE_AGENT = True
        self.TEST_CASE_ARGS['policy_to_ensure'] = 'learning_A2C'
        self.TEST_CASE_ARGS['policies'] = ['learning_A2C', 'RVO']
        self.TEST_CASE_ARGS['policy_distr'] = [0.5, 0.5] # Added to prevent error 'assert(len(policies)==len(policy_distr))' in test_cases.py, occurs during env.reset()


I've had trouble implementing model.learn() from stable-baselines, so in the meantime I am simply implementing a random agent where the agent samples a possible action:

import os
os.environ['GYM_CONFIG_CLASS'] = 'Train'

import gym
from gym_collision_avoidance.envs import Config
import gym_collision_avoidance.envs.test_cases as tc
from gym_collision_avoidance.experiments.src.env_utils import create_env
from stable_baselines.common.policies import MlpPolicy
#from stable_baselines.common import make_vec_env
from stable_baselines import A2C
from stable_baselines.common.env_checker import check_env

# env: a VecEnv wrapper around the CollisionAvoidanceEnv
# one_env: an actual CollisionAvoidanceEnv class (the unwrapped version of the first env in the VecEnv)
env, one_env = create_env()

# check_env(env, warn=True)

model = A2C(MlpPolicy, env, verbose=1)
# model.learn(total_timesteps=1000)

# The reset method is called at the beginning of an episode
obs = env.reset()

num_episodes = 1000

for i in range(num_episodes):
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    if done:
        obs = env.reset()

When I attempt to debug the code, the following error occurs:

action = env.action_space.sample() produces a (2,) array of actions as expected. However, during env.step(action), the function self._take_action(actions, dt) is called in the collision_avoidance_env.py file, where i run into an error:

for agent_index, agent in enumerate(self.agents):
            if agent.is_done:
                continue
            elif agent.policy.is_external:
                all_actions[agent_index, :] = agent.policy.external_action_to_action(agent, actions[agent_index])
            else:
                dict_obs = self.observation[agent_index]
                all_actions[agent_index, :] = agent.policy.find_next_action(dict_obs, self.agents, agent_index)

The agent.policy.external_action_to_action() method passes actions[agent_index] as the action. However, my actions variable is a (2,) array with a heading and speed for my agent. Therefore, only the delta heading angle is passed as an argument (and would have an index error if more than 2 agents defined in class Train() of config.py).

Am I implementing a portion of the code incorrectly, or should I change the _take_actions(actions,dt) method in collision_avoidance_env.py so that it passes the (2,) array of actions to external_action_to_action() method? (see below for change)

            elif agent.policy.is_external:
                all_actions[agent_index, :] = agent.policy.external_action_to_action(agent, actions)
mfe7 commented 3 years ago

Hi @krishna-bala this all looks nice and makes sense. I suggest you change action passed to the env in trainA2C.py to be a dictionary of actions (where the key is the id of the agent you'd like to take that random action, e.g., action = {0: [spd, heading]})

krishna-bala commented 3 years ago

Hi @mfe7 -- I just noticed that mistake as well after comparing code to example.py.

I modified my code to instantiate actions as a dict, which solves the error from self._take_action(actions,dt) in collision_avoidance_env.py. Now, an action list of [delta heading angle, speed] should get passed to agent.policy.external_action_to_action().

However, the VecEnv wrapper around the CollisionAvoidanceEnv (object env from the method create_env() ) calls the step() method from vec_env.py from the baselines library.

This vec_env.py step() method calls step_async() and step_wait() from dummy_vec_env.py (baselines library).

In the method step_wait(), the action dict gets converted to a list before calling env.step() (the method defined in collision_avoidance_env.py). See code below.

def step_wait(self):
        print('step_wait() method')
        for e in range(self.num_envs):
            action = self.actions[e]
            # if isinstance(self.envs[e].action_space, spaces.Discrete):
            #    action = int(action)
            print('action from step_wait() method: {}'.format(action))
            obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
            if self.buf_dones[e]:
                obs = self.envs[e].reset()
            self._save_obs(e, obs)
        return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones),
                self.buf_infos.copy())

The one_env (unwrapped) environment doesn't have this issue where the action is converted from a dict to a list before being called. Should I nest a dict inside of the actions dict when using the VecEnv wrapper? e.g. action = {0: {0: [spd, heading]}} ?

Here is the random agent that works properly on the one_env environment.

import os
os.environ['GYM_CONFIG_CLASS'] = 'Train'

import gym
from gym_collision_avoidance.envs import Config
import gym_collision_avoidance.envs.test_cases as tc
from gym_collision_avoidance.experiments.src.env_utils import create_env
from stable_baselines.common.policies import MlpPolicy
#from stable_baselines.common import make_vec_env
from stable_baselines import A2C
from stable_baselines.common.env_checker import check_env

# env: a VecEnv wrapper around the CollisionAvoidanceEnv
# one_env: an actual CollisionAvoidanceEnv class (the unwrapped version of the first env in the VecEnv)

env, one_env = create_env()

# check_env(env, warn=True)
# model = A2C(MlpPolicy, env, verbose=1)
# model.learn(total_timesteps=1000)

# The reset method is called at the beginning of an episode
obs = one_env.reset()

num_episodes = 1000

for i in range(num_episodes):
    actions = {}
    actions[0] = one_env.action_space.sample()

    obs, reward, done, info = one_env.step(actions)
    if done:
        obs = one_env.reset()
mfe7 commented 3 years ago

Hmm is it possible that you can send a list of dicts (e.g., actions = [{0: [spd, heading]}]) to env.step, because the VecEnv wrapper you posted there would then grab the first element of that list and send it to environment 0.

mfe7 commented 3 years ago

I think that's essentially what we did here: https://github.com/mit-acl/gym-collision-avoidance/blob/dceff1094affdaabeaf624976ffa3918addbe17d/gym_collision_avoidance/experiments/src/env_utils.py#L45-L52

krishna-bala commented 3 years ago

Yes that works! Thank you.