Open joshmiller17 opened 5 years ago
Instead of MCTS returning one action at a time, it could predict the next sequence of actions up until a random element is introduced, such as:
States/Actions can mark themselves as random or not.
Would be interesting to see its top ~3 unique sequence choices in order of predicted value.
Idea: create a reward function which approximates the value of a state based on designer heuristics, used only to guide MCTS rollout policy probabilities for more efficient searches
Instead of MCTS returning one action at a time, it could predict the next sequence of actions up until a random element is introduced, such as:
States/Actions can mark themselves as random or not.
Would be interesting to see its top ~3 unique sequence choices in order of predicted value.