openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.73k stars 8.61k forks source link

Spaces class support 'sample without replacement' method? #497

Closed rockingdingo closed 7 years ago

rockingdingo commented 7 years ago

Hi OpenAI team,

I am quite new to gym package and want to know if it is reasonable to add methods to the Space related class to support ‘sample without replacement’ situation?

Right now spaces classes like Discrete/Box are assuming the action_space is fixed and valid all the time. And the sample() method randomly choose one action from the [0,n) action_space (sample with replacement). You can sample the same action many times during one episode. There games include ‘atari’ and others whose action space don’t change.

'Sample with replacement'

class Discrete (gym.Space):
    def __init__(self, n):
        self.n = n
    def sample(self):
        return prng.np_random.randint(self.n)

However there are a lot of other games, in which the valid action spaces keep changing. For example, like ‘Go’ game you can’t take action on the positions where there are already stones on the board, otherwise it’s illegal and episode ends too early. And the sample() method should only sample from the remaining valid actions. Other games like ‘poker’ games, valid action spaces is limited to your hand cards and will keep becoming smaller.

Do you think there is a need to add method to the Space class to keep track of the remaining spaces so that env class can know and only sample from the valid ones?

I am working on a Gomoku(five-in-a-row on Go board) environment. And now I use a work-around method : add a remove() method to the Space, So that whenever action is taken, it can be eliminated from the valid_action_space. And sample() method will be only be sampling from valid ones.

something like: 'Sample without replacement'

class DiscreteWrapper (spaces.Discrete):
    def __init__(self, n):
        self.n = n
        self.valid_spaces = list(range(n))

    def sample(self):
        # only sample from the remaining valid_spaces
        # … 

    def remove(self, s):
        '''Remove space s from the valid spaces
        '''
        if s in self.valid_spaces:
            self.valid_spaces.remove(s)
        else:
            print ("space %d is not in valid spaces" % s)
tlbtlbtlb commented 7 years ago

I don't think it belongs in the action spaces which are intended to remain simple. Logic like keeping track of available moves belongs in the agent. If you want the agent to take random samples while eliminating illegal moves, rejection sampling is efficient in all but the most pathological situations:

    while True:
        action = space.sample()
        if is_valid_action(action): break
rockingdingo commented 7 years ago

Hi @tlbtlbtlb Thank you so much for your advice. Yes, rejection sampling is good for training the agent. A further question is that: what if there is an opponent in the Env, like the ‘white’ opponent player in the ‘Go’ game, and the opponent random samples() among illegal actions and the game will always finish earlier than normal and thus incomplete?

‘Go’ Environment: I tried to reproduce the top algorithm on the website under the ‘Go’ game whose reward is always ‘1’. It seems like the black ’X’ policy is putting on the same fixed positions for all episodes. The trick is once the opponent ‘white’ randomly sample() a move from the whole action space, if it is already taken previously, it is thus illegal. The env always raise ‘lose’ status. That’s how the game is always rewards as ‘1’ while in fact the real world Go game never ends like this. And the agent can’t truly learn from incomplete game, right?

Any ideas?

Top Evaluations on website: https://gym.openai.com/evaluations/eval_JIYm7FoWQlu1s1KIoijdAQ

rockingdingo commented 7 years ago

Btw, there are only 3 games in the category ‘board_game’ now. And I just finished another board game ‘Gomoku’ (Five-in-a-row) on the Go board. May I know if it’s reasonable to contribute new environments to enrich this category?

I followed the guidance, and ‘gym_gomoku’ is already working. Repo: https://github.com/rockingdingo/gym-gomoku

Thank you.

tlbtlbtlb commented 7 years ago

env = gym.make('Go9x9-v0') returns a go env with illegal_move_mode='lose'. A good RL agent will learn not to make illegal moves in the same way it will learn not to make other bad moves: by associating the move with losing.

If you want to use a version of the env that reports illegal moves rather than losing, you can call

import pachi_py, gym.envs.board_game.go
...
env = gym.envs.board_game.go.GoEnv(player_color='black', opponent='pachi:uct:_2400', observation_type='image3c', illegal_move_mode='raise', board_size=9)

Calling .step() with an illegal move will raise an exception pachi_py.IllegalMove, which you can catch with a try-except block:

while True:
    a = env.action_space.sample()
    try:
        observation, reward, done, info = env.step(a)
    except pachi_py.IllegalMove:
        continue
    break