suragnair / alpha-zero-general

A clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more
MIT License
3.74k stars 1.01k forks source link

CanonicalForm changes outcome of gameplay #273

Open qu15d opened 1 year ago

qu15d commented 1 year ago

For my thesis, I'm trying to implement Nine Men's Morris as a new game. I already implemented every method from the abstract Game class and I got the train loop running. However, the agent doesn't learn anything, after two days of training every model since the beginning has been rejected because it couldn't win against the previous iteration. I started investigating and found out that once the game is run in canonical form (like during the MCTS-Episodes) the only game outcomes are: opponent is winning or game is a draw. There were no samples gathered where the player was winning, therefore the net probably couldn't learn.

To better understand the error, I build a dummy loop with random actions. Here are some findings:

I checked the other included games - Othello doesn't have a draw game ending, and with my dummy code the outcome of canonicalForm games and non-canonical Form games is exactly the same. However, the same issue happens with TicTacToe and Connect4.

I'm starting to think that this behavior is inherent to this implementation, as I couldn't figure out anything for the last couple of days.

My questions:

Any help would be greatly appreciated.

suragnair commented 1 year ago

Hi could you check #263 and related issues and let me know if there's a link?

qu15d commented 1 year ago

Yes, #263 seems to be connected to this issue. To better investigate that possible bug, I'm using the following code which is based upon the MCTS logic and the playGame method from the Arena class.

def run_game_play_test(number_of_runs=100, canonical=False):
    game = TicTacToeGame()
    np.random.seed(344)
    end_value_list = []
    for i in range(number_of_runs):
        next_s = game.getInitBoard()
        next_player = 1
        while True:
            end_value = game.getGameEnded(next_s, next_player)
            if end_value != 0:
                end_value_list.append(next_player * end_value)
                break
            else:
                actions = game.getValidMoves(next_s, next_player)
                indices = [i for i, x in enumerate(actions) if x == 1]
                a = np.random.choice(indices)
                next_s, next_player = game.getNextState(next_s, next_player, a)
                if canonical:
                    next_s = game.getCanonicalForm(next_s, next_player)
                    next_player = 1
    return end_value_list

Notice that I fix the random seed for reproducibility. With this code I can quickly simulate either switching players (1, -1, 1, ...) or switching the board (canonical board, next_player is always 1). The code is compatible to every game that implements the abstract Game class. I count wins and draws with the following code:

def get_end_value_statistics(end_value_list):
    player_wins = 0
    opponent_wins = 0
    draws = 0
    for i in end_value_list:
        if i == 1:
            player_wins += 1
        elif i == -1:
            opponent_wins += 1
        else:
            draws += 1
    print("player wins: ", player_wins)
    print("opponent wins: ", opponent_wins)
    print("draws: ", draws)

My findings:

My own implementation shows the same pattern as TicTacToe - player wins, opponent wins and draws are roughly evenly distributed, as soon as I use only canonical mode the player wins get added to the opponent wins. My current understanding is that the outcome of using a canonical board should be exactly the same. However, after days of trying to figure something out, I'm not so sure about that assumption anymore.

I guess this issue massively impacts the training process, since the model never gets to observe a player win, but I'm not really sure on that either.

suragnair commented 1 year ago

Wow you're on to something. Can you check #156 and what happens if you revert back to the previous version? I'll look closer in the following days (sorry for the slow response).

qu15d commented 1 year ago

I made some progress based on your comment. I updated my test code with the information from the previous version:

def run_game_play_test(number_of_runs=100, canonical=False, multiply_final_next_player=False, random_seed=None):
    game = TicTacToeGame()
    np.random.seed(random_seed)
    end_value_list = []

    for i in range(number_of_runs):
        next_s = game.getInitBoard()
        next_player = 1
        while True:
            end_value = game.getGameEnded(next_s, next_player)
            if end_value != 0:
                if multiply_final_next_player:
                    end_value_list.append(next_player * end_value)
                else:
                    end_value_list.append(end_value)
                break
            else:
                actions = game.getValidMoves(next_s, next_player)
                indices = [i for i, x in enumerate(actions) if x == 1]
                a = np.random.choice(indices)
                next_s, next_player = game.getNextState(next_s, next_player, a)
                if canonical:
                    next_s = game.getCanonicalForm(next_s, next_player)
                    next_player = 1
    return end_value_list

Now I can simulate what happens if next_player is multiplied to the value of the game. As for the code in Arena: return self.game.getGameEnded(board, 1) and curPlayer*self.game.getGameEnded(board, curPlayer) are equivalent in outcome value. It's only important that the getGameEnded method actually uses the player variable to determine the game outcome e.g. in TicTacToe

 if b.is_win(player):
            return 1
 if b.is_win(-player):
            return -1

Neither of these versions really change the problem with the canonicalForm - experiment with TicTacToe:

Could this be problematic for the training process? That the train examples only contain opponent wins and draws?

I solved this problem in my own game. I explicitly keep track of the current player within the state vector - with this hack it is possible to observe all three possible outcomes (player wins, opponent wins and draws). However, this seems to impact training negatively. I just started a new training session and the old model keeps winning.

So I'm not sure anymore if this is the problem. Especially since the other games are implemented like this and seem to train fine for other people (I haven't started a train session for TicTacToe). Maybe logically an opponent win is the same as a player win since the MCTS plays against itself ...

qu15d commented 1 year ago

I found a relevant bug in "Coach.py". In line 66 the game ending is calculated as follows:

r = self.game.getGameEnded(board, self.curPlayer)

Which is not consistend to how this value is calculated in Arena.py. It needs to be:

r = self.curPlayer * self.game.getGameEnded(board, self.curPlayer)

I fixed this issues in my local copy and started a new round of training for my own game implementation. Hoping for the best.

suragnair commented 1 year ago

Thanks for digging into this! I believe line 69 of coach

https://github.com/suragnair/alpha-zero-general/blob/1db83a7f12c9550558975a94c65dfcbc22f3a630/Coach.py#L69

does the multiplication.

qu15d commented 1 year ago

I think you are right. It's pretty hard not to get lost in the code. I'm still struggling to get my training working. Right now the new model always loses to the old model. Something must be still wrong

qu15d commented 1 year ago

I'm still on it. I traced the problem back to the MCTS implementation. Both Coach and Arena use the CanonicalForm only to call MCTS and then perform the action like this:

board, curPlayer = self.game.getNextState(board, curPlayer, action)

Now, MCTS needs to simulate the rest of the game from the current game state but only uses CanonicalForm like this:

next_s, next_player = self.game.getNextState(canonicalBoard, 1, a)
next_s = self.game.getCanonicalForm(next_s, next_player)
v = self.search(next_s)

Which leads to the following problem:

Is this intended behaviour? Is this the reaseon why "-v" gets returned? But why is self.Es[s] not the negative value of the GameEnded method? I assume that MCTS search should ideally return all possible ending values.