Chapter 13: Possible errors in AlphaGoMCTS agent

JingOY0610 commented 4 years ago

In dlgo.agent.alphago.AlphaGoMCTS we have the policy rollout function in line 142.

def policy_rollout(self, game_state):
for step in range(self.rollout_limit):
    if game_state.is_over():
        break
    move_probabilities = self.rollout_policy.predict(game_state)
    encoder = self.rollout_policy.encoder
    valid_moves = [m for idx, m in enumerate(move_probabilities) 
                    if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
    max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))
    max_point = encoder.decode_point_index(max_index)
    greedy_move = Move(max_point)
    if greedy_move in game_state.legal_moves():
        game_state = game_state.apply_move(greedy_move)
next_player = game_state.next_player
winner = game_state.winner()
if winner is not None:
    return 1 if winner == next_player else -1
else:
    return 0

However while line 148 eliminates the invalid moves, it also removes the indices of those moves, which makes the max_point no longer being the real point with maximum policy probability.

In dlgo.agent.alphago.AlphaGoNode we have the expand_children function in line 41

def expand_children(self, moves, probabilities):
for move, prob in zip(moves, probabilities):
    if move not in self.children:
        self.children[move] = AlphaGoNode(probability=prob)

However it does not assign the children with its parent.

My correction:

def expand_children(self, moves, probabilities):
    for move, prob in zip(moves, probabilities):
        if move not in self.children:
            self.children[move] = AlphaGoNode(parent=self, probability=prob)

wang869 commented 4 years ago

can you share your trained h5 files?I have no enough condition to train a good agent.

maxpumperla commented 4 years ago

@JingOY0610 thank you.

ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?

ad 2) yes, you're right. mind sending a quick PR? thanks!

JingOY0610 commented 4 years ago

@JingOY0610 thank you.

ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?

ad 2) yes, you're right. mind sending a quick PR? thanks!

For ad 1): So we first predict the probabilities for each move here move_probabilities = self.rollout_policy.predict(game_state) move_probabilities is a list of probabilities with each element representing the prob. of that index. But later on we did

valid_moves = [m for idx, m in enumerate(move_probabilities) 
        if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))

Here we removed some indices by keeping only valid moves indices. The move_probabilities then do not represent the prob. of that index any more because some indices are removed.

Here is an example: Say move_probabilities = [0, 0, 0, 0.01, 0.99, ...., 0] The highest probability index is 4 Say the index 0 is not a valid move. Then valid_moves = [0, 0, 0.01, 0.99, ..., 0] Then we select the max_index to be 3, which is not right.

I also fixed this bug. Would you mind taking a review on my PR? Thanks!

maxpumperla commented 4 years ago

@JingOY0610 got it, you're right. thanks for spotting this - extremely helpful

maxpumperla / deep_learning_and_the_game_of_go

Chapter 13: Possible errors in AlphaGoMCTS agent #55