Batching with a variable action space

dzy1997 commented 3 years ago

I am trying to train an agent for a board game, Blokus, which has a potentially large (~32K) action space, while only a small fraction of them are valid by the rule per step. I am implementing the environment which provides all valid actions as observation per step (the shape of observation differs every step). Next I plan to train a DQN agent, but I am not sure how replay buffer batches work with a variable action space. I have read the tic-tac-toe example in the docs. What should I change from tic-tac-toe to adapt to a variable number of actions per step? Thanks!

Trinkle23897 commented 3 years ago

I'm not quite sure a variable action space means. btw, such a large action space (32K) may not let DQN converge even adding the mask, and I would suggest to reduce or change the action space :)

tic-tac-toe example stores action mask in obs by adding a gym.Wrapper which changes the observation space to a dict of (obs, mask, id). Of course, you can use other format of mask because in your setting this binary array (32K length) is a waste of memory, but remember to change the corresponding policy.forward code snippets.

dzy1997 commented 3 years ago

A variable action space means that the number of valid actions varies each step. Since I encode each valid action into observation, the observation (input to the Q-network) also varies each step, and the output of the Q-network is a vector of length n where n is the number of valid actions at that step (I am using an all-true mask). I am curious that when we take a batch from the replay buffer, given the different shape of observation and Q-values, how would tianshou handle that batch?

Trinkle23897 commented 3 years ago

the observation (input to the Q-network) also varies each step

so I'm wondering how will you implement this Q-network without using transformer architecture? IMHO pytorch network accepts fixed data shape input&output in most cases.

dzy1997 commented 3 years ago

That would be my question as well. Currently I am making use of "batch" and ConvNets in PyTorch. For a step with N valid actions, the input shape is (N, 1, H, W) and output shape would be (N, 1, Hout, Wout), which is further mapped to output score of shape (N,) with a linear layer. It seems OK that I evaluate each action independently. But I doubt if it still works with "batch" of samples from replay buffer. (I know it probably will be a hack, just curious if it's possible at all)

Trinkle23897 commented 3 years ago

Probably not. AFAIK only Transformer network can handle variant length of input and output.

dzy1997 commented 3 years ago

OK so I got an exception when trying to collect an episode from my environment by collector.collect(n_episode=1, render=.1). The collector seems to add consecutive steps into a buffer and since the my environment provides a variable number of actions per step, I got a shape mismatch error when merging the masks (the length of mask is equal to the number of actions per step). I have not even get to the Q-network and already failed. Does that mean tianshou won't work at all with an environment providing a variable length mask each step, and I have to at least put some False in mask to keep its shape consistent every step? Should I do the same thing to the 'obs' tensor?

Trinkle23897 commented 3 years ago

I'm doing it in #362.

Trinkle23897 commented 3 years ago

Could you please have a try with #375?

dzy1997 commented 3 years ago

I tried and it did not work. (I was using 0.4.1 release and I manually pasted the content of the commit of #375 into batch.py.) I still got the shape mismatch error (of the obs tensor).

Trinkle23897 commented 3 years ago

Could you please provide the minimal code snippet so that I can have a check? Also, have you re-install the edited version? Can you successfully run the code in https://github.com/thu-ml/tianshou/pull/375#issue-650773441?

dzy1997 commented 3 years ago

Let me paste my code here since it's not long. Maybe it will work on your dev branch.

import gym
import numpy as np
from functools import partial
from typing import Tuple, Optional
from numpy.lib.twodim_base import _trilu_indices_form_dispatcher

from tianshou.env import MultiAgentEnv
from copy import deepcopy

class BlokusEnv(MultiAgentEnv):
    # block-related
    blocks=[
        np.array([[0,0]]),
        np.array([[0,0],[1,0]]),
        np.array([[0,0],[1,0],[2,0]]),
        np.array([[0,0],[1,0],[0,1]]),
        np.array([[0,0],[1,0],[2,0],[3,0]]),
        np.array([[0,0],[1,0],[2,0],[0,1]]),
        np.array([[0,0],[-1,0],[1,0],[0,1]]),
        np.array([[0,0],[1,0],[1,1],[2,1]]),
        np.array([[0,0],[1,0],[0,1],[1,1]]),
        np.array([[0,0],[1,0],[2,0],[3,0],[4,0]]),
        np.array([[0,0],[1,0],[2,0],[3,0],[0,1]]),
        np.array([[0,0],[1,0],[2,0],[3,0],[1,1]]),
        np.array([[0,0],[1,0],[2,0],[1,1],[2,1]]),
        np.array([[0,0],[1,0],[2,0],[0,1],[2,1]]),
        np.array([[-1,-1],[-1,0],[0,0],[1,0],[1,1]]),
        np.array([[0,0],[-1,0],[1,0],[0,1],[1,-1]]),
        np.array([[0,0],[-1,0],[1,0],[1,-1],[1,1]]),
        np.array([[0,0],[1,0],[2,0],[2,1],[2,2]]),
        np.array([[0,0],[-1,-1],[0,-1],[1,0],[1,1]]),
        np.array([[0,0],[-1,0],[1,0],[0,-1],[0,1]]),
        np.array([[0,0],[1,0],[2,0],[0,1],[0,2]])
    ] # x+ to right, y+ to down
    transforms=[
        np.array([[1,0],[0,1]]), # orig
        np.array([[0,1],[-1,0]]), # counter90
        np.array([[-1,0],[0,-1]]), # counter180
        np.array([[0,-1],[1,0]]), # counter270
        np.array([[-1,0],[0,1]]), # xflip
        np.array([[0,1],[1,0]]), # +counter90
        np.array([[1,0],[0,-1]]), # +counter180
        np.array([[0,-1],[-1,0]]) # +counter270
    ] 
    def trans_block(block: np.ndarray, trans):
        return trans.dot(block.T).T
    def block_to_grid(block): 
        # list of coords to 2d array of bools, (y,x) indexing
        x1,x2,y1,y2 = block[:,0].min(),block[:,0].max(),block[:,1].min(),block[:,1].max()
        h,w = y2-y1+1, x2-x1+1
        grid = np.full((h,w),0)
        for (x,y) in block:
            grid[y-y1, x-x1] = 1
        return grid
    def block_to_grid_all(block: np.ndarray, all_trans):
        grids = []
        for trans in all_trans:
            t_block = BlokusEnv.trans_block(block, trans)
            grid = BlokusEnv.block_to_grid(block)
            eq = False
            for grid1 in grids:
                if np.array_equal(grid, grid1):
                    eq = True
                    break
            if not eq:
                grids.append(grid)
        return grids
    def grid_to_block(grid):
        block = []
        for y in range(len(grid)):
            for x in range(len(grid[0])):
                if grid [y,x] > 0:
                    block.append([x,y])
        return np.array(block)
    def is_valid_move(board, grid, x, y, player):
        # board is processed, 1 for me, -1 for others, 0 for empty
        block = BlokusEnv.grid_to_block(grid)
        block = block + np.array([x,y]) # translate block
        m = len(board) - 1
        # must be on board
        for (x,y) in block:
            if x<0 or x>m or y<0 or y>m:
                return False
        # no overlap
        for (x,y) in block:
            if board[y,x] != 0:
                return False
        # first piece
        if (board == 1).sum() == 0:
            (x0, y0) = (4, 9) if player == 0 else (9, 4) 
            for (x,y) in block:
                if x == x0 and y == y0:
                    return True
            return False
        # no sharing side
        for (x,y) in block:
            if x>0:
                if board[y,x-1] == 1: return False
            if x<m:
                if board[y,x+1] == 1: return False
            if y>0:
                if board[y-1,x] == 1: return False
            if y<m:
                if board[y+1,x] == 1: return False
        # at least share one corner
        for (x,y) in block:
            if x>0 and y>0:
                if board[y-1,x-1] == 1: return True
            if x>0 and y<m:
                if board[y+1,x-1] == 1: return True
            if x<m and y>0:
                if board[y-1,x+1] == 1: return True
            if x<m and y<m:
                if board[y+1,x+1] == 1: return True
        return False
    def convert_board(board, player):
        pos = board==player
        zero = board==-1
        board1 = np.full_like(board,-1)
        board1[pos] = 1
        board1[zero] = 0
        return board1

    def __init__(self):
        super().__init__()
        self.size = 14
        self.remain = np.full((2,21),True)
        self.passed = np.full((2),False)
        self.current_board = None
        self.valid_boards, self.valid_pieces = None, None
        self.current_agent = None
        self.last_move = None
        self.step_num = None

    def reset(self) -> dict:
        self.current_board = np.full((self.size, self.size), -1, dtype=np.int32)
        self.current_agent = 0
        self.valid_boards, self.valid_pieces = self.enumerate_valid_boards(self.current_agent)
        self.last_move = np.full((self.size, self.size), False)
        self.step_num = 0
        return {
            'agent_id': self.current_agent+1, # This stuff is 1-indexed
            'obs': np.array(self.valid_boards),
            'mask': np.full(len(self.valid_boards), True)
        }

    def simulate_move(self, move):
        board = self.current_board
        new_board = deepcopy(board)
        for (x,y) in move:
            new_board[y,x] = self.current_agent
        return new_board

    def enumerate_valid_boards(self,player): 
        # Each move is np.array, a list of board coordinates of the move
        valid_moves = []
        valid_pieces = []
        board = BlokusEnv.convert_board(self.current_board, self.current_agent)
        trans = BlokusEnv.transforms
        indices = np.arange(21)[self.remain[player]]
        for i in indices:
            block = BlokusEnv.blocks[i]
            grids = BlokusEnv.block_to_grid_all(block, trans)
            for grid in grids:
                h, w = grid.shape
                for y in range(self.size-h+1):
                    for x in range(self.size-w+1):
                        valid = BlokusEnv.is_valid_move(board, grid, x, y, player)
                        if valid:
                            move = BlokusEnv.grid_to_block(grid)+np.array([x,y])
                            valid_moves.append(move)
                            valid_pieces.append(i)

        valid_boards = [self.simulate_move(move) for move in valid_moves]
        valid_boards.append(self.current_board)
        valid_pieces.append(-1)
        return valid_boards, valid_pieces

    def step(self, action: [int, np.ndarray]
             ) -> Tuple[dict, np.ndarray, np.ndarray, dict]:
        if self.current_agent is None:
            raise ValueError(
                "calling step() of unreset environment is prohibited!")
        assert 0 <= action < len(self.valid_boards)
        current_agent = self.current_agent
        self._move(action)
        # the game is over when both players passed
        done = self.passed.sum() == 2
        if self.last_move is not None:
            reward = self.last_move.sum()
        else:
            reward = 0
        obs = {
            'agent_id': self.current_agent,
            'obs': np.array(self.valid_boards),
            'mask': np.full(len(self.valid_boards), True)
        }
        vec_rew = np.array([0, 0], dtype=np.float32)
        vec_rew[current_agent] = reward
        if done:
            self.current_agent = None
        return obs, vec_rew, np.array(done), {}

    def _move(self, action):
        board = self.valid_boards[action]
        piece = self.valid_pieces[action]
        self.last_move = board != self.current_board
        self.current_board = board
        self.step_num += 1
        if piece == -1:
            print(f'Step {self.step_num} Player {self.current_agent} passes')
            self.passed[self.current_agent] = True
        else:
            print(f'Player {self.current_agent} plays piece {piece}')
            self.remain[self.current_agent][piece] = False
        # turn transition
        if self.passed.sum() < 2:
            if not self.passed[1-self.current_agent]:
                self.current_agent = 1-self.current_agent
            self.valid_boards, self.valid_pieces = self.enumerate_valid_boards(self.current_agent)
            if len(self.valid_boards) == 1:
                self._move(0)

    def seed(self, seed: Optional[int] = None) -> int:
        pass

    def render(self, **kwargs) -> None:
        print(f'board (step {self.step_num}):')
        for y in range(self.size):
            for x in range(self.size):
                ch = '_'
                n = self.current_board[y,x]
                last = self.last_move[y,x]
                if n==0:
                    ch = 'A' if last else 'a'
                if n==1:
                    ch = 'B' if last else 'b'
                print(ch,end=' ')
            print('')
        print('--------')

    def close(self) -> None:
        pass

env = BlokusEnv()
obs = env.reset()

from tianshou.data import Collector
from tianshou.policy import RandomPolicy, MultiAgentPolicyManager

# agents should be wrapped into one policy,
# which is responsible for calling the acting agent correctly
# here we use two random agents
policy = MultiAgentPolicyManager([RandomPolicy(), RandomPolicy()])

# use collectors to collect a episode of trajectories
# the reward is a vector, so we need a scalar metric to monitor the training
collector = Collector(policy, env)

# you will see a long trajectory showing the board status at each timestep
result = collector.collect(n_episode=1, render=.1)

FYI, the first two steps always have 90 valid actions by rule, so the collection fails at step 3 on my machine.

Trinkle23897 commented 3 years ago

cool, I'll take a look tonight.

dzy1997 commented 3 years ago

The test code in #375 comment won't work as well on my machine. "IndexError: tuple index out of range" occured on networkx/utils/decorators.py:457, when creating an dummy environment on line train_envs = DummyVectorEnv([lambda i=x: GraphEnv(size=i) for x in [5, 10, 15]])

Trinkle23897 commented 3 years ago

So that it means you didn't successfully apply the change to your current env. Please reinstall through pip install -e . in tianshou's main directory.

Trinkle23897 commented 3 years ago

The reason is

In [8]: a=np.zeros([2,90,14,14])

In [9]: b=np.array([None]*2)

In [10]: b[0]=a[0]

In [11]: b[[0]]=a[[0]]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-f02f040c9b13> in <module>
----> 1 b[[0]]=a[[0]]

ValueError: shape mismatch: value array of shape (1,90,14,14)  could not be broadcast to indexing result of shape (1,)

Trinkle23897 commented 3 years ago

Could you please test again with the updated code in #375? Also there's a bug in your script:

        obs = {
            'agent_id': self.current_agent,  # this should be self.current_agent + 1
            'obs': np.array(self.valid_boards),
            'mask': np.full(len(self.valid_boards), True)
        }

dzy1997 commented 3 years ago

Thanks for finding the bug in my code! I will try your latest code in #375, but the commits seem to be in your forked repo so I am not sure how to install. Should I first uninstall the 0.4.1 release, then apply your commits on top of the master branch of thu-ml/tianshou repo, and finally run pip install -e . ?

Trinkle23897 commented 3 years ago

No need to uninstall, just reinstall by pip install -e . under my forked repo

dzy1997 commented 3 years ago

I just reinstalled from your forked repo and is still getting the shape mismatch error below at step 3:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/tianshou/tianshou/data/buffer/manager.py in add(self, batch, buffer_ids)
    130         try:
--> 131             self._meta[ptrs] = batch
    132         except ValueError:

~/tianshou/tianshou/data/batch.py in __setitem__(self, index, value)
    257             try:
--> 258                 self.__dict__[key][index] = value[key]
    259             except KeyError:

~/tianshou/tianshou/data/batch.py in __setitem__(self, index, value)
    257             try:
--> 258                 self.__dict__[key][index] = value[key]
    259             except KeyError:

ValueError: shape mismatch: value array of shape (1,75) could not be broadcast to indexing result of shape (1,90)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-3-092d14b9d3fa> in <module>
     15 
     16 # you will see a long trajectory showing the board status at each timestep
---> 17 result = collector.collect(n_episode=1, render=.1)

~/tianshou/tianshou/data/collector.py in collect(self, n_step, n_episode, random, render, no_grad)
    242 
    243             # add data into the buffer
--> 244             ptr, ep_rew, ep_len, ep_idx = self.buffer.add(
    245                 self.data, buffer_ids=ready_env_ids)
    246 

~/tianshou/tianshou/data/buffer/manager.py in add(self, batch, buffer_ids)
    139                 _alloc_by_keys_diff(self._meta, batch, self.maxsize, False)
    140             self._set_batch_for_children()
--> 141             self._meta[ptrs] = batch
    142         return ptrs, np.array(ep_rews), np.array(ep_lens), np.array(ep_idxs)
    143 

~/tianshou/tianshou/data/batch.py in __setitem__(self, index, value)
    256         for key, val in self.items():
    257             try:
--> 258                 self.__dict__[key][index] = value[key]
    259             except KeyError:
    260                 if isinstance(val, Batch):

~/tianshou/tianshou/data/batch.py in __setitem__(self, index, value)
    256         for key, val in self.items():
    257             try:
--> 258                 self.__dict__[key][index] = value[key]
    259             except KeyError:
    260                 if isinstance(val, Batch):

ValueError: shape mismatch: value array of shape (1,75) could not be broadcast to indexing result of shape (1,90)

Where 75 is probably the number of choices at step 3, so the 'mask' is not matching. Should I fix something in my code, regarding "The reason is..." that you posted above?

Trinkle23897 commented 3 years ago

Hmm it seems like you didn't install it correctly because L258 in batch.py currently is https://github.com/Trinkle23897/tianshou/blob/d2e8f4c153324082498630eea0ba2ef4694548e9/tianshou/data/batch.py#L258

257    def __setitem__(self, index: Union[str, IndexType], value: Any) -> None:
258        """Assign value to self[index]."""
259        value = _parse_value(value)

but yours are

~/tianshou/tianshou/data/batch.py in __setitem__(self, index, value)
    256         for key, val in self.items():
    257             try:
--> 258                 self.__dict__[key][index] = value[key]
    259             except KeyError:
    260                 if isinstance(val, Batch):

Have you checkout batch-with-object branch?

Trinkle23897 commented 3 years ago

BTW, a simple (but dirty) workaround is:


class Data:
    pass

# ... and in your env observation (step and reset):
        temp_obs = Data()
        temp_mask = Data()
        temp_obs.obs = np.array(self.valid_boards)  # pack these variables into an object
        temp_mask.mask = np.full(len(self.valid_boards), True)
        return {
            'agent_id': self.current_agent+1, # This stuff is 1-indexed
            'obs': temp_obs,
            'mask': temp_mask,
        }

I guess it can work directly in v0.4.1 (no need to use #375) even though it is inefficient in memory layout.

dzy1997 commented 3 years ago

Oh I did forgot to switch to batch-with-object branch yesterday. After I switched branch I can now simulate almost a whole episode, except for a ValueError at the end, when both players have passed. The error message is

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/tianshou/tianshou/data/batch.py in __setitem__(self, index, value)
    270             try:
--> 271                 self.__dict__[key][index] = value[key]
    272             except KeyError:

ValueError: shape mismatch: value array of shape (1,44,14,14) could not be broadcast to indexing result of shape (1,11,14,14)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/tianshou/tianshou/data/batch.py in __setitem__(self, index, value)
    283                     for i in index:  # type: ignore
--> 284                         self.__dict__[key][i] = value[key][i]
    285                 except Exception:

ValueError: could not broadcast input array from shape (44,14,14) into shape (11,14,14)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-3-092d14b9d3fa> in <module>
     15 
     16 # you will see a long trajectory showing the board status at each timestep
---> 17 result = collector.collect(n_episode=1, render=.1)

~/tianshou/tianshou/data/collector.py in collect(self, n_step, n_episode, random, render, no_grad)
    260                 if self.preprocess_fn:
    261                     obs_reset = self.preprocess_fn(obs=obs_reset).get("obs", obs_reset)
--> 262                 self.data.obs_next[env_ind_local] = obs_reset
    263                 for i in env_ind_local:
    264                     self._reset_state(i)

~/tianshou/tianshou/data/batch.py in __setitem__(self, index, value)
    284                         self.__dict__[key][i] = value[key][i]
    285                 except Exception:
--> 286                     raise ValueError
    287 
    288     def __iadd__(self, other: Union["Batch", Number, np.number]) -> "Batch":

ValueError:

Also, the dirty workaround you proposed above could not work directly, as the negation of Data class (used by policy forward) is undefined. Part of the error message is

~/tianshou/tianshou/policy/random.py in forward(self, batch, state, **kwargs)
     35         mask = batch.obs.mask
     36         logits = np.random.rand(*mask.shape)
---> 37         logits[~mask] = -np.inf
     38         return Batch(act=logits.argmax(axis=-1))
     39 

TypeError: bad operand type for unary ~: 'Data'

Trinkle23897 commented 3 years ago

After I switched branch I can now simulate almost a whole episode, except for a ValueError at the end, when both players have passed. The error message is

Yes it is the same as mine. At that time I see the env obs has no agent_id in [1, 2]. Does it stop at the correct timestep?

Also, the dirty workaround you proposed above could not work directly, as the negation of Data class (used by policy forward) is undefined. Part of the error message is

You can change the random policy to other fashion accordingly, like:

masks = batch.obs.mask  # now masks is a numpy array of `Data`
acts = []
for i in range(len(masks)):  # since the lengths of each mask are not the same, we have to deal with them one by one
  mask = masks[i].mask  # now mask is a 1d numpy array extracted from masks
  logits = np.random.rand(*mask.shape)
  logits[~mask] = -np.inf
  acts.append(logits.argmax(axis=-1))
return Batch(act=acts)

dzy1997 commented 3 years ago

At that time I see the env obs has no agent_id in [1, 2].

I printed out the obs dict in my step() function when done is True. I did find the agent_id key and it was either 1 or 2. What could change the obs afterwards? Also, I guess the mechanism of auto-pass when there is only one choice may lead to bugs, but it was in tic-tac-toe example as well so I adopted it? What would happen if I do not use it? Would it break the Q-learning algorithm with only one valid action?

Trinkle23897 commented 3 years ago

how about now? plz pull the newest code in batch-with-object

dzy1997 commented 3 years ago

I can now successfully collect a whole episode without errors. However when I try to collect more than 1 episode, an exception will occur at the beginning of the second episode as follows:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-2-3c48d1bcdc67> in <module>
     15 
     16 # you will see a long trajectory showing the board status at each timestep
---> 17 result = collector.collect(n_step=100, render=.1)

~/tianshou/tianshou/data/collector.py in collect(self, n_step, n_episode, random, render, no_grad)
    207                     with torch.no_grad():  # faster than retain_grad version
    208                         # self.data.obs will be used by agent to get result
--> 209                         result = self.policy(self.data, last_state)
    210                 else:
    211                     result = self.policy(self.data, last_state)

~/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/tianshou/tianshou/policy/multiagent/mapolicy.py in forward(self, batch, state, **kwargs)
    120                 # reward can be empty Batch (after initial reset) or nparray.
    121                 tmp_batch.rew = tmp_batch.rew[:, policy.agent_id - 1]
--> 122             out = policy(batch=tmp_batch, state=None if state is None
    123                          else state["agent_" + str(policy.agent_id)],
    124                          **kwargs)

~/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/tianshou/tianshou/policy/random.py in forward(self, batch, state, **kwargs)
     36         print('mask dtype is',mask.dtype)
     37         logits = np.random.rand(*mask.shape)
---> 38         logits[~mask] = -np.inf
     39         return Batch(act=logits.argmax(axis=-1))
     40 

IndexError: arrays used as indices must be of integer (or boolean) type

With some debugging I found that the dtype of mask in random.py:forward() is bool at previous steps but object at the step of the exception. Its content is correct though, printed as [array([True True ... True])] while it should be [[True True ... True]] as it was at previous steps.

Trinkle23897 commented 3 years ago

Yes but this is a natural problem with different lengths of mask. You have to deal with it one by one as I stated the code snippet in https://github.com/thu-ml/tianshou/issues/369#issuecomment-848493107, because numpy array cannot accept an array of object as an index. You can inherit the existing DQNPolicy and overwrite the corresponding part.

thu-ml / tianshou

Batching with a variable action space #369