Open JingOY0610 opened 4 years ago
can you share your trained h5 files?I have no enough condition to train a good agent.
@JingOY0610 thank you.
ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?
ad 2) yes, you're right. mind sending a quick PR? thanks!
@JingOY0610 thank you.
ad 1) I'm not sure I understand your point. we need to pick the best performing move among the legal ones. what do you have in mind there?
ad 2) yes, you're right. mind sending a quick PR? thanks!
For ad 1):
So we first predict the probabilities for each move here
move_probabilities = self.rollout_policy.predict(game_state)
move_probabilities is a list of probabilities with each element representing the prob. of that index. But later on we did
valid_moves = [m for idx, m in enumerate(move_probabilities)
if Move(encoder.decode_point_index(idx)) in game_state.legal_moves()]
max_index, max_value = max(enumerate(valid_moves), key=operator.itemgetter(1))
Here we removed some indices by keeping only valid moves indices. The move_probabilities then do not represent the prob. of that index any more because some indices are removed.
Here is an example: Say move_probabilities = [0, 0, 0, 0.01, 0.99, ...., 0] The highest probability index is 4 Say the index 0 is not a valid move. Then valid_moves = [0, 0, 0.01, 0.99, ..., 0] Then we select the max_index to be 3, which is not right.
I also fixed this bug. Would you mind taking a review on my PR? Thanks!
@JingOY0610 got it, you're right. thanks for spotting this - extremely helpful
However while line 148 eliminates the invalid moves, it also removes the indices of those moves, which makes the max_point no longer being the real point with maximum policy probability.
However it does not assign the children with its parent.
My correction: