google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.
Apache License 2.0
4.26k stars 934 forks source link

Tabular CFR final nash equilibrium plays the sides on empty tic_tac_toe board #1288

Closed fumin closed 6 days ago

fumin commented 1 week ago

Why does Tabular CFR play the sides on an empty tic_tac_toe board? Shouldn't the correct play be at the center? By playing the sides I mean the CFR final policy for the empty board is [0.072 0.177 0.072 0.177 0.002 0.177 0.072 0.177 0.072]. The action of playing the center gets the lowest probability of 0.002, which is also very weird.

The nashconv value of 0.005 after 1024 steps, does seem to indicate successful training, though.

Nash conv after step 0 is 1.0296009897922054
Nash conv after step 1 is 1.0231448034382464
Nash conv after step 2 is 1.0558273704815282
Nash conv after step 4 is 0.7840930066452345
Nash conv after step 8 is 0.5131075756977104
Nash conv after step 16 is 0.27398820392206685
Nash conv after step 32 is 0.14882404817885986
Nash conv after step 64 is 0.07843115321372457
Nash conv after step 128 is 0.040219964482678616
Nash conv after step 256 is 0.02033826050459841
Nash conv after step 512 is 0.010215142856195864
Nash conv after step 1024 is 0.005114789250588642
         p=[0.072 0.177 0.072 0.177 0.002 0.177 0.072 0.177 0.072]
0, 1     p=[0.000 0.000 0.000 0.406 0.547 0.000 0.047 0.000 0.000]
0, 1, 2, 3   p=[0.000 0.000 0.000 0.000 0.846 0.077 0.000 0.000 0.077]
0, 1, 2, 3, 4, 5   p=[0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111]
0, 1, 2, 3, 4, 5, 7, 6   p=[0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111]
0, 1, 2, 3, 4, 5, 7, 8   p=[0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111]
0, 1, 2, 3, 4, 6   p=[0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000]
0, 1, 2, 3, 4, 6, 5, 7   p=[0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111]

Below is the code to replicate the above results (taken from https://github.com/google-deepmind/open_spiel/blob/master/open_spiel/colabs/CFR_and_REINFORCE.ipynb)

import argparse
import itertools
import logging

import numpy as np
import pyspiel
from open_spiel.python.algorithms import exploitability
from open_spiel.python import policy as policy_lib

def new_reach(so_far, player, action_prob):
  """Returns new reach probabilities."""
  new = np.array(so_far)
  new[player] *= action_prob
  return new

def calc_cfr(game, policy, curr_policy, regrets, state, reach):
  """Updates regrets; returns utility for all players."""
  if state.is_terminal():
    return state.returns()
  elif state.is_chance_node():
    return sum(prob * calc_cfr(game, policy, curr_policy, regrets, state.child(action), new_reach(reach, -1, prob))
               for action, prob in state.chance_outcomes())
  else:
    # We are at a player decision point.
    player = state.current_player()
    index = policy.state_index(state)

    # Compute utilities after each action, updating regrets deeper in the tree.
    utility = np.zeros((game.num_distinct_actions(), game.num_players()))
    for action in state.legal_actions():
      prob = curr_policy[index][action]
      utility[action] = calc_cfr(game, policy, curr_policy, regrets, state.child(action), new_reach(reach, player, prob))

    # Compute regrets at this state.
    cfr_prob = np.prod(reach[:player]) * np.prod(reach[player+1:])
    value = np.einsum('ap,a->p', utility, curr_policy[index])
    for action in state.legal_actions():
      regrets[index][action] += cfr_prob * (utility[action][player] - value[player])

    # Return the value of this state for all players.
    return value

def print_policy(policy):
  for state, probs in zip(itertools.chain(*policy.states_per_player),
                          policy.action_probability_array):
    print(f'{state:6}   p={probs}')

def main():
    game = pyspiel.load_game("tic_tac_toe")
    policy = policy_lib.TabularPolicy(game)
    initial_state = game.new_initial_state()
    curr_policy = policy.action_probability_array.copy()
    regrets = np.zeros_like(policy.action_probability_array)
    eval_steps = []
    eval_nash_conv = []
    iterations = 129
    iterations = 1024 + 1
    for step in range(iterations):
        # Compute regrets
        calc_cfr(game, policy, curr_policy, regrets, initial_state, np.ones(1 + game.num_players()))

        # Find the new regret-matching policy
        floored_regrets = np.maximum(regrets, 1e-16)
        sum_floored_regrets = np.sum(floored_regrets, axis=1, keepdims=True)
        curr_policy = floored_regrets / sum_floored_regrets

        # Update the average policy
        lr = 1 / (1 + step)
        policy.action_probability_array *= (1 - lr)
        policy.action_probability_array += curr_policy * lr

        # Evaluate the average policy
        if step & (step-1) == 0 or step == iterations-1:
            nc = exploitability.nash_conv(game, policy)
            eval_steps.append(step)
            eval_nash_conv.append(nc)
            print(f'Nash conv after step {step} is {nc}')

    np.set_printoptions(precision=3, suppress=True, floatmode='fixed')
    print_policy(policy)

if __name__ == "__main__":
    main()
lanctot commented 6 days ago

Hi @fumin , this is not necessarily incorrect. The minimax-optimal values for every action at the root of Tic-Tac-Toe are zero. Since CFR is computing an approximate Nash equilibrium, there is no guarantee that it'll prefer any specific action over any other, since game-theoretically they're all equivalent.

fumin commented 6 days ago

Thanks for the explanation, that makes a lot of sense. In other words, two player zero sum only guarantees the existence of equilibria, but not its uniqueness. Any mixture among equilibria is another equilibrium. Thanks for the insight!

lanctot commented 6 days ago

It's not really about uniqueness per se, it's more about definition of the Nash equilibrium. Any move made by the first player has the same game-theoretic optimal value (0) because player 1 can still force a tie under all circumstances. So it's indifferent. Any Nash equilibrium will assign values of 0 to all nine moves, and so any policy is "optimal" assuming the play afterward remains optimal.

fumin commented 5 days ago

Thanks for the correction. I do need to change my conclusion to:

Any mixture among equilibria, *of the same value*, is another equilibrium

In general, different equilibria may have different values, especially in coordination games. However, in the case of zero-sum games, is it true that all equilibria have the same value "0", and thus it's always OK to mix equilibrium strategies?

lanctot commented 5 days ago

Yes, that is correct! 👍