Closed Hororohoruru closed 1 year ago
Check out this documentation. You can implement your own rollout
function.
I am not sure what you mean by "highest immediate value", but in the rollout
function, you have access to the state (and possibly history). You can compute some forward simulations from that state to get a value estimate according to some heuristic policy.
I mean to take the action with the highest Q-value instead of a random action. Is that possible?
How would you have access the to Q value during rollout?
After having a look at the code and other sources I realize it does not make sense since the rollout simulations are what give an estimate to the value on the first place...
But for example, couldn't the rollout function be called with the current belief of the Vnode? Then the rollout could perform a random selection weighted on the current belief.
Sure, you can sample from current belief and call the rollout function. I believe that's what PORollout does.
I see, but that is to sample the state in the _search
method. If I am not mistaken, PO_UCT does the same.
However, I would like to do that during rollout. Should I make a subclass of POMCP or POUCT? As you proposed in #35
Yes, I think you would have to figure out your own implementation for that.
Great, thanks! I'll close the issue then.
Hello! In tuning with my particular problem, I am exploring potential solutions to some problems I'm having related to the different steps of the POMCP simulation. One of them is to change the rollout policy used to add new nodes to the tree.
Right now, the POMCP is using a random rollout that returns any actions with a uniform probability. I would like to implement a rollout that always selects the action with the highest immediate value, but it is not clear to me how to code it with the existing
PolicyModel
interface.