maxpumperla / deep_learning_and_the_game_of_go

Code and other material for the book "Deep Learning and the Game of Go"
https://www.manning.com/books/deep-learning-and-the-game-of-go
987 stars 390 forks source link

Chapter 13: Possible errors in AlphaGoMCTS agent #55

Open JingOY0610 opened 4 years ago

JingOY0610 commented 4 years ago
  1. In dlgo.agent.alphago.AlphaGoMCTS we have the policy rollout function in line 142.
    def policy_rollout(self, game_state):
    for step in range(self.rollout_limit):
        if game_state.is_over():
            break
        move_probabilities = self.rollout_policy.predict(game_state)
        encoder = self.rollout_policy.encoder
        valid_moves = [m for idx, m in enumerate(move_probabilities) 
                        if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
        max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))
        max_point = encoder.decode_point_index(max_index)
        greedy_move = Move(max_point)
        if greedy_move in game_state.legal_moves():
            game_state = game_state.apply_move(greedy_move)
    next_player = game_state.next_player
    winner = game_state.winner()
    if winner is not None:
        return 1 if winner == next_player else -1
    else:
        return 0

However while line 148 eliminates the invalid moves, it also removes the indices of those moves, which makes the max_point no longer being the real point with maximum policy probability.

  1. In dlgo.agent.alphago.AlphaGoNode we have the expand_children function in line 41
    def expand_children(self, moves, probabilities):
    for move, prob in zip(moves, probabilities):
        if move not in self.children:
            self.children[move] = AlphaGoNode(probability=prob)

However it does not assign the children with its parent.

My correction:

def expand_children(self, moves, probabilities):
    for move, prob in zip(moves, probabilities):
        if move not in self.children:
            self.children[move] = AlphaGoNode(parent=self, probability=prob)
wang869 commented 4 years ago

can you share your trained h5 files?I have no enough condition to train a good agent.

maxpumperla commented 4 years ago

@JingOY0610 thank you.

ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?

ad 2) yes, you're right. mind sending a quick PR? thanks!

JingOY0610 commented 4 years ago

@JingOY0610 thank you.

ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?

ad 2) yes, you're right. mind sending a quick PR? thanks!

For ad 1): So we first predict the probabilities for each move here move_probabilities = self.rollout_policy.predict(game_state) move_probabilities is a list of probabilities with each element representing the prob. of that index. But later on we did

valid_moves = [m for idx, m in enumerate(move_probabilities) 
        if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))

Here we removed some indices by keeping only valid moves indices. The move_probabilities then do not represent the prob. of that index any more because some indices are removed.

Here is an example: Say move_probabilities = [0, 0, 0, 0.01, 0.99, ...., 0] The highest probability index is 4 Say the index 0 is not a valid move. Then valid_moves = [0, 0, 0.01, 0.99, ..., 0] Then we select the max_index to be 3, which is not right.

I also fixed this bug. Would you mind taking a review on my PR? Thanks!

maxpumperla commented 4 years ago

@JingOY0610 got it, you're right. thanks for spotting this - extremely helpful