Making a greedy rollout policy for POMCP

h2r / pomdp-py

A framework to build and solve POMDP problems. Documentation: https://h2r.github.io/pomdp-py/

MIT License

209 stars 49 forks source link

Making a greedy rollout policy for POMCP #34

Closed Hororohoruru closed 1 year ago

Hororohoruru commented 1 year ago

Hello! In tuning with my particular problem, I am exploring potential solutions to some problems I'm having related to the different steps of the POMCP simulation. One of them is to change the rollout policy used to add new nodes to the tree.

Right now, the POMCP is using a random rollout that returns any actions with a uniform probability. I would like to implement a rollout that always selects the action with the highest immediate value, but it is not clear to me how to code it with the existing PolicyModel interface.

zkytony commented 1 year ago

Check out this documentation. You can implement your own rollout function.

I am not sure what you mean by "highest immediate value", but in the rollout function, you have access to the state (and possibly history). You can compute some forward simulations from that state to get a value estimate according to some heuristic policy.

Hororohoruru commented 1 year ago

I mean to take the action with the highest Q-value instead of a random action. Is that possible?

zkytony commented 1 year ago

How would you have access the to Q value during rollout?

Hororohoruru commented 1 year ago

After having a look at the code and other sources I realize it does not make sense since the rollout simulations are what give an estimate to the value on the first place...

But for example, couldn't the rollout function be called with the current belief of the Vnode? Then the rollout could perform a random selection weighted on the current belief.

zkytony commented 1 year ago

Sure, you can sample from current belief and call the rollout function. I believe that's what PORollout does.

Hororohoruru commented 1 year ago

I see, but that is to sample the state in the _search method. If I am not mistaken, PO_UCT does the same.

However, I would like to do that during rollout. Should I make a subclass of POMCP or POUCT? As you proposed in #35

zkytony commented 1 year ago

Yes, I think you would have to figure out your own implementation for that.

Hororohoruru commented 1 year ago

Great, thanks! I'll close the issue then.