Closed Hororohoruru closed 1 year ago
It seems like you have multiple terminal states, which is quite complex. Do you know your models are implement correctly? Bugs happen a lot (when implementing a POMDP), and it can be very challenging to figure out what's wrong based on the output behavior. (It is even harder for me to do this for you)
My reactions to a few things you said:
"I found that the POMCP planning tree always takes a decision after the first observation, with really low belief on the corresponding state." (I'm not sure what this means)
"which in turn would collect more information" (that's not the criteria by which decisions are made)
I suggest writing unit tests for the components of your POMDP. Tests where you simulate your POMDP based on manually chosen actions are valuable too.
Sorry, I noticed a bug where I created two different terminal states on TransitionModel
. There is only one terminal state. As far as model implementation, I tested the probability()
functions for the transition, observation and reward models with all combinations of s
, s'
, a
and o
, and all the outputs were as expected.
"I found that the POMCP planning tree always takes a decision after the first observation, with really low belief on the corresponding state." (I'm not sure what this means)
What I mean is that the belief for the tree starts uniformly distributed among the 12 possible states at the first time step. At this point, the highest valued action is wait
(equivalent to Tiger's 'listen'). However, if we look down the tree via indexing, we see that after executing action 'wait' and receiving any observation (o_3
in the first post's example), the best action for the following step is to execute the action corresponding to the received observation. At that point in time, the belief for the state corresponding to that action is around 0.6.
"which in turn would collect more information" (that's not the criteria by which decisions are made)
What I meant is that, given that the belief after one observation is around 0.6, and given the reward for correct decisions (10) and the cost for incorrect ones (100), the agent would not take a decision until the belief for a given state is between .85 or .98 in most cases. Usually after one observation the usual course of action is to select 'wait' multiple times in order to receive more observations.
While I understand that testing this is not possible from your side, I would like to turn to the behavior of the tree after executing action wait
and receiving observation o_3
at the beginning (the belief for s_3
is not .60):
(Pdb) dd[12][5]
o_3⟶_VNodePP(n=788, v=-56.256)
- [0] a_0: QNode(n=1, v=-100.000)
- [1] a_1: QNode(n=1, v=-100.000)
- [2] a_10: QNode(n=1, v=-100.000)
- [3] a_11: QNode(n=1, v=-100.000)
- [4] a_2: QNode(n=1, v=-100.000)
- [5] a_3: QNode(n=772, v=-56.256)
- [6] a_4: QNode(n=1, v=-100.000)
- [7] a_5: QNode(n=1, v=-100.000)
- [8] a_6: QNode(n=1, v=-100.000)
- [9] a_7: QNode(n=1, v=-100.000)
- [10] a_8: QNode(n=1, v=-100.000)
- [11] a_9: QNode(n=1, v=-100.000)
- [12] a_wait: QNode(n=5, v=-62.203)
While I'm not surprised by the value of taking a_3
here, I'm surprised that the value for wait
is lower, and I believe it has to do with the number of simulations, since a_3
transitions to the terminal state as shown by indexing the tree again:
(Pdb) dd[12][5][5]
a_3⟶_QNodePP(n=772, v=-56.256)
- [0] o_term: VNodeParticles(n=771, v=0.000)
In the meantime, only 5 simulations went to wait
, and then the potential observations have no simulations in them:
(Pdb) dd[12][5][12]
a_wait⟶_QNodePP(n=5, v=-62.203)
- [0] o_3: VNodeParticles(n=0, v=0.000)
- [1] o_4: VNodeParticles(n=0, v=0.000)
- [2] o_5: VNodeParticles(n=0, v=0.000)
- [3] o_6: VNodeParticles(n=0, v=0.000)
- [4] o_8: VNodeParticles(n=0, v=0.000)
I will continue testing, but I would like to know whether you suspect what can be causing the simulations to be so skewed towards a_3
in this case with a belief as low as 0.6. As far as the implementation of the terminal state goes, I just want to know if the elements I introduced (I believe correctly, I will test further) are correct:
s_term
that is transitioned to whenever an action other than wait
is taken or any action is taken at the last time step of a trial (denoted by TDTransitionModel.max_t
)o_term
I think what you’re observing may be a limitation or MCTS. Since the number or simulations is limited, it cannot provide an accurate estimate of all belief states, and tends to exploit and not explore enough (but this could be tuned, and indeed you seem to have tried).
What’s the planning depth? You’re getting 0 visits because either (1) the tree has reached its depth limit or (2) the tree has not expanded there yet due to insufficient number of simulations.
I understand. However, does it make sense to dedicate 99% of the simulations to a path that has a deterministic outcome? Maybe my understanding of the MCTS is limited, but I don't understand why more simulations would go to a path that has a deterministic transition. I understand it exploits the action with the best value, but I don't understand then how wait
is initialized to a lower value than taking an immediate decision with a belief as low...
The planning depth starts at 7, so I believe it should have enough in order to expand up to there...
Also, I reverted back to before implementing the terminal state and this was not an issue, so I imagine the problem comes from there.
Update! After playing around with long simulation times / high number of simulations, I believe the problem does indeed come from there. When initially left to plan for 10 seconds, the estimation of the value on the subsequent steps w.r.t the belief is much closer to what is expected.
As such planning times are not allowed by the problem I'm trying to work on, is it possible to run a long planning at the beginning of a POMDP run, save that initial tree, and initialize each trial with that tree? Trials are independent but their initial belief is reset to be identical.
I’m glad!
yes it should be possible. After planning initially, the agent will have a “tree” attribute which represents the policy. The tricky part is if during run time you get observations not on any branch.
Cool! Can I just assign the tree to a variable and then create a new instance of the problem and assign the initial tree to the new agent?
The tricky part is if during run time you get observations not on any branch.
Indeed, that is something that can happen. However, my idea was to keep the online planning I do during the trial. It is not much, but I have time to plan in-between steps. Even if the number of simulations will be dramatically lower, it should help fine-tune the tree after it is pruned. Hopefully that will avoid getting unexpected observations.
Closing this now. Feel free to reopen if comes up again.
Hello. When thinking about changing the rollout function as mentioned in #34 , I realized that the model should have a terminal state in order for the POMCP planner to not add any value after taking a decision within a trial. However, after implementing such a state using the reward and transition models, I found that the POMCP planning tree always takes a decision after the first observation, with really low belief on the corresponding state. Since this was not happening before, I was wondering whether I incorrectly implemented the goal state.
First, this is how the tree for a given trial looks
Here we suppose we get o_3, the observation corresponding to the correct state:
What surprises me here is the high number of simulations going to a_3 here given that the only thing that happens is going to the terminal state:
While there are only 5 simulations for the 'wait' action, which in turn would collect more information and has different possible observations:
I don't think this behavior is normal... it did not change when changing the exploration constant between the default value and 50. The terminal state is implemented as a state with
id
'term' (i.e.s_term
). Here are my transition and reward models:Here is also my observation_model in case it helps: