Make it a valid gym env

Hizoul commented 4 years ago

Dear Eddie,

first of all thank you for contributing to the open source reinforcement learning community by offering your implementation of Go! I like the possibility of having a nice UI to look at what is happening below deck. However your gym environment doesn't fulfill the basic requirements of an OpenAI Gym.

The description of Gym states that it is "A toolkit for developing and comparing reinforcement learning algorithms.". However if I try to use an RL algorithm (e.g. stable_baselines) as is done in the learn.py we will run into a few issues that my pull request fixes:

No observation_space defined.
action_space needs to be a gym.spaces object not just a regular python number.
moved action_space into constructor because it may not change during the game because it is used to initialize the size of the output layer for the neural network
(not fixed by pull request) The algorithm will pretty quickly suggest an invalid turn which throws an exception which aborts the learning process.

The fourth point is actually why I wanted to look at how you solve this. For my own (currently not yet opensourced) problem I just let it output an array of numbers between 0 and 1 and then take the highest value that represents a valid play (people reading this this might not be the best solution!). If you come up with another more solid less guess work solution to 4, I'd love to hear about it! I should also note that just giving out a negative reward and not progressing the environment on an invalid action did not work for me it would keep suggesting bad numbers and because of too many negative rewards run into the NAN bug.

Best Regards and thanks again for your valuable contribution to the open source community, Matthias

Error for #1:

Traceback (most recent call last):
  File "learn.py", line 7, in <module>
    env = DummyVecEnv([lambda: env])
  File "stable_baselines/common/vec_env/dummy_vec_env.py", line 23, in __init__
    self.keys, shapes, dtypes = obs_space_info(obs_space)
  File "stable_baselines/common/vec_env/util.py", line 71, in obs_space_info
    shapes[key] = box.shape
AttributeError: 'NoneType' object has no attribute 'shape'

Error for #2:

File "learn.py", line 9, in <module>
    rl_algo = PPO2("MlpPolicy", env=env)
  File "stable_baselines/ppo2/ppo2.py", line 101, in __init__
    self.setup_model()
  File "stable_baselines/ppo2/ppo2.py", line 134, in setup_model
    n_batch_step, reuse=False, **self.policy_kwargs)
  File "stable_baselines/common/policies.py", line 661, in __init__
    feature_extraction="mlp", **_kwargs)
  File "stable_baselines/common/policies.py", line 541, in __init__
    scale=(feature_extraction == "cnn"))
  File "stable_baselines/common/policies.py", line 221, in __init__
    self._pdtype = make_proba_dist_type(ac_space)
  File "stable_baselines/common/distributions.py", line 491, in make_proba_dist_type
    " Must be of type Gym Spaces: Box, Discrete, MultiDiscrete or MultiBinary.")
NotImplementedError: Error: probability distribution, not implemented for action space of type <class 'int'>. Must be of type Gym Spaces: Box, Discrete, MultiDiscrete or MultiBinary.

Error for #4:

Traceback (most recent call last):
  File "learn.py", line 10, in <module>
    rl_algo.learn(10000)
  File "stable_baselines/ppo2/ppo2.py", line 335, in learn
    obs, returns, masks, actions, values, neglogpacs, states, ep_infos, true_reward = runner.run()
  File "stable_baselines/ppo2/ppo2.py", line 480, in run
    self.obs[:], rewards, self.dones, infos = self.env.step(clipped_actions)
  File "stable_baselines/common/vec_env/base_vec_env.py", line 134, in step
    return self.step_wait()
  File "stable_baselines/common/vec_env/dummy_vec_env.py", line 40, in step_wait
    self.envs[env_idx].step(self.actions[env_idx])
  File "gym_go/envs/go_env.py", line 89, in step
    self.state = GoGame.get_next_state(self.state, action)
  File "gym_go/gogame.py", line 77, in get_next_state
    raise Exception("Invalid Move", action, state)
Exception: ('Invalid Move', (1, 3), [....])

huangeddie commented 4 years ago

HI Hizoul,

Thanks for your PR. I like the changes for the action_space and observation_space! However, I want this repo to be as lightweight as possible. Therefore, could you remove the learn.py script? I don't want it to depend on this stable_baselines package.

In regards to the invalid moves, that information is embedded in the fourth channel of the state. It's also mentioned in the documentation. There's a function called uniform_random_action that demos how to use it.

Hizoul commented 4 years ago

Hi Eddie,

I understand wanting to keep the environment as lightweight as possible and I removed the file dependent on stable_baselines from the branch. I only included it so you had some code to be able to reproduce the fourth problem.

You are right that the uniform_random_action function only suggests valid actions. However said function will never be called by a RL algorithm nor is it used to clean the input in the step function. Including the info in the 4th layer is definitely helpful for training. However the step function will still 100% be called with invalid actions suggested from an RL algorithm during its training resulting in the 4th error mentioned which prevents the training from continuing. Hence the environment will have to clean the action given to it in the step function similar to what your uniform_random_action function does to be usable for learning.

Best Regards, Matthias

Hizoul commented 4 years ago

For full clarity regarding the fourth issue. You could change the code as follows to allow an RL algorithm to properly train without the game throwing an error that shuts down the whole python program. I tried it and with these changes the removed learn.py and it then runs through without getting aborted by an invalid turn error.

However, I am not certain that this is a good solution. Especially since the neural network won't know what action actually got chosen and which values that it outputs are relevant. That's why I include it here as a comment instead of pushing it to the branch. Change the action_space to output an array of numbers:

self.action_space = gym.spaces.Box(0,1, shape=(GoGame.get_action_size(self.state),))

Change the step function to clean the suggested actions:

    def step(self, actions):
        '''
        Assumes the correct player is making a move. Black goes first.
        return observation, reward, done, info
        '''
        valid_moves = self.get_valid_moves()
        valid_move_idcs = np.argwhere(valid_moves > 0).flatten()
        current_max = -1.0
        action = 0
        for valid_move in valid_move_idcs:
          if actions[valid_move] > current_max:
            current_max = actions[valid_move]
            action = valid_move

Edit: used the wrong quotation marks for the suggested code

huangeddie commented 4 years ago

Hi Hizoul,

It seems like you want to use only the gym API to select valid moves, which I understand if you're using this stable_baselines package. However, I don't think the solution you proposed is elegant. It seems like the Gym API does not have a way of showing an action space of an interval of integers with some of the integers excluded. I'm sorry for this inconvenience of yours, but if you provide a more elegant solution, I'll certainly take a look!

I'll merge your current PR.

Best, Eddie

huangeddie / GymGo

Make it a valid gym env #1