google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.
Apache License 2.0
4.26k stars 934 forks source link

deep_cfr not working on Kuhn poker #1287

Open fumin opened 5 days ago

fumin commented 5 days ago

I compared the pytorch deep_cfr with tabular cfr, and found that deep_cfr has a much higher exploitability (deep_cfr 0.83, tabular 0.02). Moverover, printing out the policy of deep_cfr seems to suggest that it doesn't differ much from random.

Below is the output of deep_cfr

Advantage for player: 0 [array(0.7407164, dtype=float32), array(1.2511972, dtype=float32), '...', array(269.03094, dtype=float32), array(256.87537, dtype=float32)]
Advantage Buffer Size for player 0 24207
Advantage for player: 1 [array(2.6763718, dtype=float32), array(3.7437634, dtype=float32), '...', array(356.60852, dtype=float32), array(298.85, dtype=float32)]
Advantage Buffer Size for player 1 16000
Strategy Buffer Size: 55828
Final policy loss: 42.306183
Deep CFR - NashConv: 0.8289245590747639
0        p=[0.483 0.517]
0pb      p=[0.479 0.521]
1        p=[0.470 0.530]
1pb      p=[0.463 0.537]
2        p=[0.478 0.522]
2pb      p=[0.439 0.561]
1p       p=[0.488 0.512]
1b       p=[0.480 0.520]
2p       p=[0.463 0.537]
2b       p=[0.444 0.556]
0p       p=[0.474 0.526]
0b       p=[0.468 0.532]

Compare the above with tabular cfr, which works nicely:

Nash conv after step 0 is 0.6666666666666656
Nash conv after step 1 is 0.5833333333333321
Nash conv after step 2 is 0.4814814814814816
Nash conv after step 4 is 0.27956581948517456
Nash conv after step 8 is 0.2045374078549737
Nash conv after step 16 is 0.0957019801773486
Nash conv after step 32 is 0.09602625213077909
Nash conv after step 64 is 0.05657638349057331
Nash conv after step 128 is 0.035512027875593655
Nash conv after step 256 is 0.029967092000654116
Nash conv after step 400 is 0.024449861042890597
0        p=[0.788 0.212]
0pb      p=[1.000 0.000]
1        p=[0.993 0.007]
1pb      p=[0.418 0.582]
2        p=[0.435 0.565]
2pb      p=[0.000 1.000]
1p       p=[0.992 0.008]
1b       p=[0.654 0.346]
2p       p=[0.000 1.000]
2b       p=[0.000 1.000]
0p       p=[0.653 0.347]
0b       p=[1.000 0.000]

Below is the code used to test deep_cfr:

import itertools
from open_spiel.python import policy as policy_module
from open_spiel.python.algorithms import exploitability
from open_spiel.python.pytorch import deep_cfr
import pyspiel
import numpy as np

import torch
import torch.nn as nn

def print_policy(policy):
  for state, probs in zip(itertools.chain(*policy.states_per_player),
                          policy.action_probability_array):
    print(f'{state:6}   p={probs}')

def main():
  game = pyspiel.load_game('kuhn_poker')
  solver = deep_cfr.DeepCFRSolver(game,
        policy_network_layers=(32, 32),
        advantage_network_layers=(16, 16),
        num_iterations=400,
        num_traversals=40,
        learning_rate=1e-3,
        batch_size_advantage=None,
        batch_size_strategy=None,
        memory_capacity=1e7)
  _, advantage_losses, policy_loss = solver.solve()
  for player, losses in list(advantage_losses.items()):
    print("Advantage for player:", player,
                  losses[:2] + ["..."] + losses[-2:])
    print("Advantage Buffer Size for player", player,
                  len(solver.advantage_buffers[player]))
  print("Strategy Buffer Size:",
                len(solver.strategy_buffer))
  print("Final policy loss:", policy_loss)
  policy = policy_module.tabular_policy_from_callable(game, solver.action_probabilities)
  conv = exploitability.nash_conv(game, policy)
  print("Deep CFR - NashConv:", conv)

  np.set_printoptions(precision=3, suppress=True, floatmode='fixed')
  print_policy(policy)

if __name__ == "__main__":
  main()
lanctot commented 1 day ago

Hi @fumin , my guess is this could be related to bad hyper-parameters for the game of choice.

Have you seen this paper? https://arxiv.org/abs/2103.00187 Have you tried hyper-parameters used for Kuhn in that paper?

There will always be a huge discrepancy between tabular methods and neural network methods. It's hard to compare those two and you certainly should not compared them iteration-by-iteration, it's like comparing apples and oranges. One is perfect (no sampling, no approximation); the other has errors from sampling and from function approximation. If you compared, e.g. value iteration and Q-learning with neural nets, you'd see something similar but it might even be worse in imperfect information games since the metric (exploitability) is a harder thing to reduce since it evaluates the policy "adversarially" rather than evaluating the reward of a (deterministic) policy in e.g. a gridworld.

That said, 0.83 is quite high for Kuhn so it's probably still an issue. Can you use the hyper-parameters and compare to the graphs in the paper and then follow-up? Can you also try the Jax and/or TF implementations to see what they give?

fumin commented 1 day ago

Thanks for the Michael Walton et al. paper, which looks like exactly what I need for this issue! It's great that paper provides code to replicate their results https://github.com/aicenter/openspiel_reproductions .

Some observations:

I will be reporting back my findings soon.