manish-pra / copg

This repository contains all code and experiments for competitive policy gradient (CoPG) algorithm.
MIT License
23 stars 8 forks source link

Questions regarding implementation of Markov soccer #1

Open vrn25 opened 3 years ago

vrn25 commented 3 years ago

Hi! I had a few queries regarding the implementation of Markov Soccer using open_spiel: https://github.com/manish-pra/copg/blob/master/markov_soccer/soccer_state.py

def get_two_state(time_step):
    state = time_step.observations["info_state"][0]
    # print(pretty_board(time_step))
    try:
        pos = np.nonzero(state[0:20])[0]
        y = int(pos/5)
        x = int(pos%5)
        a_pos = np.array([x,y])
        a_status = 0
    except:
        try:
            pos = np.nonzero(state[20:40])[0]
            y = int(pos/5)
            x = int(pos%5)
            a_pos = np.array([x,y])
            a_status =1
        except:
            a_pos = np.array([5,1.5])
            a_status = 1
    try:
        pos = np.nonzero(state[40:60])[0]
        # print(1,pos)
        y = int(pos/5)
        x = int(pos % 5)
        b_pos = np.array([x, y])
        b_status = 0
    except:
        try: # 2nd try becasue in some cases when the game is done it is possible that none of the player exists
            pos = np.nonzero(state[60:80])[0]
            # print(2,pos)
            y = int(pos / 5)
            x = int(pos % 5)
            b_pos = np.array([x, y])
            b_status = 1
        except:
            b_pos = np.array([-1,1.5])
            b_status = 1
    if a_status==1:
        o_pos = a_pos
    elif b_status==1:
        o_pos = b_pos
    else:
        pos = np.nonzero(state[80:100])[0]
        y = int(pos / 5)
        x = int(pos % 5)
        o_pos = np.array([x, y])
    Ba = o_pos - a_pos # ball relative to a position
    Bb = o_pos - b_pos # Ball relative to b position
    G1a = np.array([4,1]) - a_pos
    G2a = np.array([4,2]) - a_pos
    G1b = np.array([0,1]) - b_pos
    G2b = np.array([0,2]) - b_pos
    state1 = np.array([Ba, Bb, G1a, G2a, G1b, G2b]).reshape(12,)
    state2 = np.array([Bb, Ba, G1b, G2b, G1a, G2a]).reshape(12,)
    return a_status, b_status, state1, state2

Firstly, what is the interpretation of the state (observation for agent A) list above? It is the information state of Agent A represented as a list with 120 items containing 0s and 1s. But how does it represent the positions of players A and B and the position of the Ball in the 4x5 soccer grid? state[0:20], state[20:40], etc seem to give a hint but it is not very clear. (P.S. - Looked for this in the open_spiel implementation but could not find)

Second, according to the paper, the state vector for player A is: $s^A = [x_{GB}, y{GB}, x{ball}, y{ball}, x{B}, y_{B}]$, and the state vector of the game used during training (i.e input to the policy network of agent A) is $s = [s^A, s^B]$. However, the state1 array above does not seem to be implemented in this way. Instead of including the position of A relative to its opponent B, it has the position of the ball relative to B. Why is it so?

Also, what is the meaning of G1a, G2a, G1b, G2b arrays? They seem to be the relative position of the player with respect to the opponent's goal. However, as per the videos (and description in the paper), each goal consists of 2 cells and is located outside the 4x5 grid. In that case, why is it np.array([4,1]) and so on?

Thanks! :)

manish-pra commented 3 years ago
  1. Yes, In the implementation from open_spiel, it is a list of 120 elements filled with 0's and 1's. Use the function pretty_board() to understand what it represents. Roughly all cells in 4x5 are numbered from 1 to 20 say. First 20 is a one-hot vector representing where is agent A without the ball (1 corresponding to the location where is an agent located and rest are all zeros). Similarly Next 20 for agent B. Then, Next 20 for where is the "ball". Then the next 20 for "agent A with the ball", so and so forth. This structure of representing the state has redundant information and is also quite large. In order to shrink its size and to pass relative information about each other, we came up with this state(more description in paper page 45-46), also shown in the function above. Also, we used goalposts to provide a sense of global/absolute location in the filed.

  2. We trained agents in the fully observable case. The agent has state information about the opponent. More description on page 46 line 4. (In case of soccer probably, just using $s = s^A$ would suffice instead of $s = [s^A, s^B] $, because s^A have complete information of A and B. Although we wanted it to be uniform with the Racing game so used $s = [s^A, s^B] $).

  3. G1a, G2a, G1b, G2b provides a sense of absolute location to the agent as well as direction to go. It can probably take any other number as far as they are symmetric, you can also use G1a = np.array([3,1]) - a_pos G2a = np.array([3,2]) - a_pos G1b = np.array([1,1]) - b_pos G2b = np.array([1,2]) - b_pos Don't remember exactly but I think we number from -1 till 5.

Thanks for asking on Github. Let me know if you have any further questions.

vrn25 commented 3 years ago

Thank you for your response @manish-pra. A few more queries -

  1. Where can I find the pretty_board() function? I see you have commented it in the code, but I don't see it defined anywhere else. Is it from open_spiel?

  2. Can I have the following state (if considering just relative positions) (following x axis as -1 to 5 and y axis as 0 to 3) -

Absolute positions of the goals: G1b = np.array([5, 1]) G2b = np.array([5, 2]) G1a = np.array([-1, 1]) G2a = np.array([-1, 2])

Absolute positions of A, B, ball: apos, bpos, opos respectively

Position of A relative to B's goal: pos_a_g1b = apos - G1b pos_a_g2b = apos - G2b

Position of B relative to A's goal: pos_b_g1a = bpos - G1a pos_b_g2a = bpos - G2a

Position of A relative to B: pos_a_b = apos - bpos

Position of B relative to A: pos_b_a = bpos - apos

Position of A relative to ball: pos_a_o = apos - opos

Position of B relative to ball: pos_b_o = bpos - opos

For partially-observable case: observation input for Agent A = $s^A = [pos_a_g1b, pos_a_g2b, pos_a_b, pos_a_o]$ observation input for Agent B = $s^B = [pos_b_g1a, pos_b_g2a, pos_b_a, pos_b_o]$

In the above case, state is a 8-dim vector

For fully-observable case: observation input for Agent A = $[s^A, s^B]$, as defined above observation input for Agent B = $[s^B, s^A]$, as defined above

In the above case, state is a 16-dim vector

This looks to me as a more intuitive representation of the state.

  1. Would it be valid to use apos = np.array([5, 1.5]) when state[0:20] and state[20:40], both are completely zeros? Are y-coordinates allowed as decimals? Also, when this happens, the game would have ended (as there is no A on the board), so then why is a_status made 1?

Thank you

vrn25 commented 3 years ago

Update

There is an implementation of pretty_board() in Tic-Tac-Toe: https://github.com/deepmind/open_spiel/blob/master/open_spiel/python/examples/tic_tac_toe_dqn_vs_tabular.py#L45 It will be similar for markov soccer I think.

manish-pra commented 3 years ago
  1. Ah I see, I think I removed it while cleaning up the code. Here's the function.
def pretty_board(time_step):
  """Returns the board in `time_step` in a human readable format."""
  info_state = time_step.observations["info_state"][0]
  print(info_state)
  a_locations = np.nonzero(info_state[0:20])[0]
  A_locations = np.nonzero(info_state[20:40])[0]
  b_locations = np.nonzero(info_state[40:60])[0]
  B_locations = np.nonzero(info_state[60:80])[0]
  o_locations = np.nonzero(info_state[80:100])[0]
  t_locations = np.nonzero(info_state[100:120])[0]
  board = np.full(4 * 5, ".")
  board[o_locations] = "0"
  board[a_locations] = "a"
  board[A_locations] = "A"
  board[B_locations] = "B"
  board[b_locations] = "b"
  board[t_locations] = "t"
  board = np.reshape(board, (4, 5))
  return board
  1. Yes, you can use that state. In the end as a policy, we are learning a map between state and actions. What we need is that the complete information is given to the agent. If instead of [a, b], state is given as [a+b, a-b] or [a+b,a] these are all one and same thing.
vrn25 commented 3 years ago

Thanks @manish-pra I think using $[s^A, s^B]$ and $[s^B, s^A]$ (which are 16-dim vector) as I defined above, would mean having redundant information. We can instead use $s^A$ for agent 1 and $s^B$ for agent 2. However, not sure if it will cause a difference in the performance. If it does, then it might decrease as we have moved from fully observed to a partially observed setting.

vrn25 commented 3 years ago

Also, while training Markov soccer, how many episodes did it roughly take for the winning probability of each agent to saturate? I have tried it and the win rates for both agents became close to 0.5 in much lesser than 30000 episodes (as stated in the paper).

manish-pra commented 3 years ago

Let me start with answering the previous questions.

  1. The same reason as the point 2 goes here. Moreover np.array([5, 1.5]) is a terminal state, where forward pass/ or agent policy is not evaluated. You can also use decimal for all state and can come up with any other transformation on the state vector. It will not alter the final result, as mapping will be learnt by the network as long as it span through the entire information.

Change in a_status/b_status information is used to capture the number of times a ball is transferred/snatched from one agent to another, since the beginning of the game. At the end of the game, status is not changed and the agent, who has the ball will finally go and score the goal. So a_status is kept same =1.

(another comment) 4. Yes, with $[s^A, s^B]$ and $[s^B, s^A]$ agent have redundant information. Here, only using $s^A$ makes environment state fully observable. But we still went with $[s^A, s^B]$ to keep state uniform across all the games.

(another comment) 5. As long as agents are equal, they will have win prob = 0.5. We need to compare them against a fixed player to know about when they are saturating.

Related to the question of state vector. As an interesting fact, if you look at state vector of Car racing game, it is written in curvilinear coordinates. If we write it in the form of global coordinates (x, y, yaw, v, etc.), then the agent will hard time in learning, because it want to capture global information and want to learn the underlying track. As opposed to learning a controller in the curvilinear coordinates (\rho, d, \mu, v, etc.). After training with this state, agent can also generalize well in other similar tracks.

There are many other games and scenarios that can be explored. Are you trying it for fun/learning?

vrn25 commented 3 years ago

Thanks, @manish-pra. Following is slightly modified implementation of the state -

def get_two_state(time_step):
    state = time_step.observations['info_state'][0]

    try:
        pos = np.nonzero(state[0:20])[0]
        x = int(pos % 5)
        y = int(pos / 5)
        apos = np.array([x, y])
        a_status = 0
    except:
        try:
            pos = np.nonzero(state[20:40])[0]
            x = int(pos % 5)
            y = int(pos / 5)
            apos = np.array([x, y])
            a_status = 1
        except:
            apos = np.array([5, 1.5])
            a_status = 1

    try:
        pos = np.nonzero(state[40:60])[0]
        x = int(pos % 5)
        y = int(pos / 5)
        bpos = np.array([x, y])
        b_status = 0
    except:
        try:
            pos = np.nonzero(state[60:80])[0]
            x = int(pos % 5)
            y = int(pos / 5)
            bpos = np.array([x, y])
            b_status = 1
        except:
            bpos = np.array([-1, 1.5])
            b_status = 1

    if a_status == 1:
        opos = apos
    elif b_status == 1:
        opos = bpos
    else:
        pos = np.nonzero(state[80:100])[0]
        x = int(pos % 5)
        y = int(pos / 5)
        opos = np.array([x, y])

    #print('apos', apos, 'bpos', bpos, 'opos', opos)

    # B's goal (absolute position)
    G1b = np.array([5, 1])
    G2b = np.array([5, 2])
    # A's goal (absolute position)
    G1a = np.array([-1, 1])
    G2a = np.array([-1, 2])

    # position of A relative to B's goal
    pos_a_g1b = apos - G1b
    pos_a_g2b = apos - G2b

    # position of B relative to A's goal
    pos_b_g1a = bpos - G1a
    pos_b_g2a = bpos - G2a

    # position of A relative to B
    pos_a_b = apos - bpos

    # position of B relative to A
    pos_b_a = bpos - apos

    # position of A relative to ball
    pos_a_o = apos - opos

    # position of B relative to ball
    pos_b_o = bpos - opos

    s_A = np.array([pos_a_g1b, pos_a_g2b, pos_a_b, pos_a_o]).reshape(8,)
    s_B = np.array([pos_b_g1a, pos_b_g2a, pos_b_a, pos_b_o]).reshape(8,)

    full_s_A = np.array([s_A, s_B]).reshape(16,)
    full_s_B = np.array([s_B, s_A]).reshape(16,)

    return a_status, b_status, full_s_A, full_s_B
  1. Returning s_A, s_B will also make the game fully observable. full_s_A just has some redundant information. How can we make this setting partially observable (in the sense of doing centralized training and decentralized execution)? It would be interesting to see how CoPG works with that. Currently, we have a centralized execution.

  2. I have modified G1a, G2a, G1b, G2b, but the relatively it is fine I think. Please have a look at this function and let me know if there could be any problem while using it. I have one other thing to confirm - In the first try-except block, we go to the except section when A has the ball, right? Now if time_step.last == False and A has the ball, we go to the try block inside the except block. And when finally when the episode terminates with A winning, we enter into the except block and set apos = np.array([5,1.5]) (terminal state), right?

manish-pra commented 3 years ago
  1. You can use full_s_A = [s_A] for decentralized execution, I would not say it as partial observability because, for this game, s_B does not add any extra information to the agent A. Car Racing game could be more interesting for such a setting, where each agent will everything about its own state and only relative pos of the other car. Velocity/yaw rate/heading of the other agent would not be known to agent 1.

  2. Yes you are right.

vrn25 commented 3 years ago

Yes, it won't be partially observable that way in Markov soccer. Car racing (and maybe some other continuous control environments too) would make sense. For the car racing simulator, I see you have provided the code. Is this an environment proposed newly in this paper or is there any other paper/documentation/codebase which could be referred? I was facing a bit of difficulty in setting it up. Thanks!

manish-pra commented 3 years ago

What is the difficulty you are facing in setting up Car Racing env? I mean is it the dynamics of the car or the way trajectory is collected for RL from simulator? The environment is not new, even the real track exists. You can find simulator code here but i don't think python version of the simulator is available online apart from CoPG repo. Extension such as Multi-agent version + Setting up it as an RL enviornement was done here. Earlier it was used for testing racing controllers.

But feel free to ask questions about setting up CarRacing environement in another issue.

vrn25 commented 3 years ago

Thanks, @manish-pra. I'll ask questions about Car racing in another issue.