Understanding Tree debugger

RajeshDM commented 4 months ago

I was debugging the best future actions and paths using the Tree Debugger.

When printing dd.mbp - the expected output is something of the sort (from https://h2r.github.io/pomdp-py/html/api/pomdp_py.utils.debugging.html ):

 listen  []
   listen  []
   listen  []
   listen  []
   open-left  []
 _VNodePP(n=4095, v=-19.529)(depth=0)
 ├─── ₀listen⟶_QNodePP(n=4059, v=-19.529)
 │    └─── ₁tiger-right⟶_VNodePP(n=2044, v=-16.160)(depth=1)
 │         ├─── ₀listen⟶_QNodePP(n=1955, v=-16.160)

What does the empty list in front of the listen action signify? when I do my dd.mbp, there's a set of actions in that list but I'm unable to understand what exactly do they mean (and mine are after a done action)

dd.mbp :

 move_rotmove_rot(-30,45)(0->0)  []
  done  [move_rotmove_rot(30,90)(0->0), move_rotmove_rot(-30,90)(0->0]

zkytony commented 4 months ago

Good question. Actions in the list are equally good (i.e. same value) as the action outside. If the list is empty, then the action outside is the best. Code ref.

RajeshDM commented 4 months ago

When it says just as good, it means the expected value at that node in the tree - all the actions at that point give the same value. But it looks like this is not taking into account the expected reward - it only seems to be taking into account the initial value assignments that happen (so for example, I have a few actions in the action prior that all receive the same initial cost but I'm hoping the reward differentiates between them) [Move_rotmove_rot action is just a rotation action]

  move_rotmove_rot(330,45)(0->0)  []
  move_rotmove_rot(360,315)(0->0)  []
  move_rotmove_rot(30,225)(0->0)  []
  done  [move_rotmove_rot(360,135)(0->0), move_rotmove_rot(30,225)(0->0), move_rotmove_rot(30,270)(0->0), move_rotmove_rot(360,225)(0->0), move_rotmove_rot(30,315)(0->0), move_rotmove_rot(30,135)(0->0), move_rotmove_rot(360,270)(0->0), move_rotmove_rot(30,90)(0->0), move_rotmove_rot(360,90)(0->0), move_rotmove_rot(360,315)(0->0), move_rotmove_rot(360,180)(0->0), move_rotmove_rot(30,180)(0->0), move(0->1)]
_VNodePP(n=299, v=-56.132)(depth=0)
├─── ₁₄move_rotmove_rot(330,45)(0->0)⟶_QNodePP(n=31, v=-56.132)
│    └─── ₀CosObservation(r:(((18, 27, 45), RobotStatus(done=False), {}));o:)⟶_VNodePP(n=30, v=-13.011)(depth=1)
│         ├─── ₁₂move_rotmove_rot(360,315)(0->0)⟶_QNodePP(n=4, v=-13.011)
│         │    └─── ₀CosObservation(r:(((18, 27, 45), RobotStatus(done=False), {}));o:)⟶_VNodePP(n=3, v=72.850)(depth=2)
│         │         ├─── ₄move_rotmove_rot(30,225)(0->0)⟶_QNodePP(n=1, v=72.850)
│         │         │    └─── ₀CosObservation(r:(((18, 27, 45), RobotStatus(done=False), {}));o:)⟶_VNodePP(n=0, v=0.000)(depth=3)
│         │         │         ├─── ₀done⟶_QNodePP(n=0, v=0.000)

So if you see - it basically says all actions are as good as done - but only done is capable of receiving a reward and none of the other actions can get any reward. So from what I understand, these nodes have been added to the tree, but the reward is not being reflected in the tree for some reason (Ideally, the path would never stop at depth 3 if depth was not invoked - and when done is invoked it should have updated the the done's node with it's reward and rest of the actions have no reward.)

Am I missing something in the how and when the argmax is computed on the tree?

zkytony commented 4 months ago

What's your planning horizon? i.e. depth

RajeshDM commented 4 months ago

The maximum depth is 20 (default). The actual plan is only 2 steps (it needs to do a move action and then do this rotate action)

zkytony commented 4 months ago

Are you sure the reward function is yielding high reward on done?

RajeshDM commented 4 months ago

Yes done gives the reward in this case.

I'm just wondering, in the case of done not giving a reward - this path should not be the best path and consequently, a different path with done giving the reward should be picked?

Or is it possible there are different paths from each of the nodes that give a reward and the done returns a 0 so this path is picked? I'm not able to clearly see the case where this path will be picked if done does not return a high reward

zkytony commented 4 months ago

I see that you only run with 300 simulations. Another possibility is 300 is not enough for the tree to be built beyond the done step. What happens if you try a larger number, 1000?

RajeshDM commented 4 months ago

I tried it with 2000 simulations and it's definitely better.

I would assume there should be nothing after the done step? because once done is called, there is no more reward to be attained (even the reward function checks if the robot is done and returns nothing), so I'm just wondering how does building the tree beyond done help in this case?

zkytony commented 4 months ago

My guess is without enough simulations, the tree hasn’t even expanded upon “done” yet.

What’s your branching factor? Are you able to prune some actions? Also, consider ActionPrior which enables a custom rollout policy that could make planning more efficient.

RajeshDM commented 4 months ago

My branching factor is about 15.

I am using action prior already - which is picking a subset of these 15 actions (generally 2/3) and picking randomly amongst them. Without that, the planning was never succesful as the branching factor is simply too big.

What exactly do you mean by expanding upon "done"? I understand that for most actions you need to expand on them and see the reward you would get by taking that action and some actions after that. But for done, you get the reward right there and there is no more reward to be found. So, does it matter if "done" has been expanded upon?

zkytony commented 4 months ago

Actually this implementation doesn’t stop expanding on a node. You probably want all rewards of future states to be 0 after done is taken.

I meant that when the node is first created, visit count is 0. That’s what you had above.

RajeshDM commented 4 months ago

Ok yea that makes sense. While I did see improvements with larger number of simulations, this part remained the same (number of visits to the done action are still 1/0). I will look into it more. Thank you very much

RajeshDM commented 4 months ago

Hey I just had a couple more questions about the TreeDebugger:

When running the tree debugger, and trying to print dd.mbp I see that Observation is part of root.children - https://github.com/h2r/pomdp-py/blob/7eeb6538f25eb8597e3c8a643906e525141a62ed/pomdp_py/utils/debugging.py#L648 In this for loop, when I run the debugger, I see that sometimes the c is an action (which is expected) and sometimes c is an observation (I'm unsure why this is the case)
Outside of the maximum depth, what dictates the depth at which the mbp terminates? I know that the rollout goes to depth 20, so I was expecting the tree to go depth 20 as well which should ideally have been reflected in the mbp but it stops at 2/3 (in the cases when done is never called but it is still the best path to take)

zkytony commented 4 months ago

Not sure either. In that case root shouldn't be VNode. Would be good if you can provide an example
This is expected. As mentioned in another thread, rollouts run till max depth, but the tree expands more slowly -- you getting 2/3 is most likely because the number of simulations wasn't enough to build the tree till max depth. Imagine you only have one simulation, do you expect the tree to grow till max depth? No, it only expands one level deeper from the root node.

RajeshDM commented 4 months ago

Not sure either. In that case root shouldn't be VNode. Would be good if you can provide an example

This is expected. As mentioned in another thread, rollouts run till max depth, but the tree expands more slowly -- you getting 2/3 is most likely because the number of simulations wasn't enough to build the tree till max depth. Imagine you only have one simulation, do you expect the tree to grow till max depth? No, it only expands one level deeper from the root node.

Right that makes sense. If action space is large and sims are low, then it takes a while to add nodes to the tree because they are not visited enough

1 . Example - so i set a debug point in the dd.mbp function:

Here is the output before I run the loop at Line 647:

for c2 in root.children:
    print (c2)
move(7->4)
move(7->2)
move(7->5)
move(7->0)
move(7->3)

Here is the output after I run the loop:

for c2 in root.children:
    print (c2)
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:)
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(15, 6))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(15, 5))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(15, 7))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(14, 7))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(16, 5))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(14, 6))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(16, 6))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(14, 5))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(16, 7))

I'm not sure why the loop is changing the value of anything in the root

h2r / pomdp-py

Understanding Tree debugger #70