Closed RajeshDM closed 4 months ago
Good question. Actions in the list are equally good (i.e. same value) as the action outside. If the list is empty, then the action outside is the best. Code ref.
When it says just as good, it means the expected value at that node in the tree - all the actions at that point give the same value. But it looks like this is not taking into account the expected reward - it only seems to be taking into account the initial value assignments that happen (so for example, I have a few actions in the action prior that all receive the same initial cost but I'm hoping the reward differentiates between them) [Move_rotmove_rot action is just a rotation action]
move_rotmove_rot(330,45)(0->0) []
move_rotmove_rot(360,315)(0->0) []
move_rotmove_rot(30,225)(0->0) []
done [move_rotmove_rot(360,135)(0->0), move_rotmove_rot(30,225)(0->0), move_rotmove_rot(30,270)(0->0), move_rotmove_rot(360,225)(0->0), move_rotmove_rot(30,315)(0->0), move_rotmove_rot(30,135)(0->0), move_rotmove_rot(360,270)(0->0), move_rotmove_rot(30,90)(0->0), move_rotmove_rot(360,90)(0->0), move_rotmove_rot(360,315)(0->0), move_rotmove_rot(360,180)(0->0), move_rotmove_rot(30,180)(0->0), move(0->1)]
_VNodePP(n=299, v=-56.132)(depth=0)
├─── ₁₄move_rotmove_rot(330,45)(0->0)⟶_QNodePP(n=31, v=-56.132)
│ └─── ₀CosObservation(r:(((18, 27, 45), RobotStatus(done=False), {}));o:)⟶_VNodePP(n=30, v=-13.011)(depth=1)
│ ├─── ₁₂move_rotmove_rot(360,315)(0->0)⟶_QNodePP(n=4, v=-13.011)
│ │ └─── ₀CosObservation(r:(((18, 27, 45), RobotStatus(done=False), {}));o:)⟶_VNodePP(n=3, v=72.850)(depth=2)
│ │ ├─── ₄move_rotmove_rot(30,225)(0->0)⟶_QNodePP(n=1, v=72.850)
│ │ │ └─── ₀CosObservation(r:(((18, 27, 45), RobotStatus(done=False), {}));o:)⟶_VNodePP(n=0, v=0.000)(depth=3)
│ │ │ ├─── ₀done⟶_QNodePP(n=0, v=0.000)
So if you see - it basically says all actions are as good as done - but only done is capable of receiving a reward and none of the other actions can get any reward. So from what I understand, these nodes have been added to the tree, but the reward is not being reflected in the tree for some reason (Ideally, the path would never stop at depth 3 if depth was not invoked - and when done is invoked it should have updated the the done's node with it's reward and rest of the actions have no reward.)
Am I missing something in the how and when the argmax is computed on the tree?
What's your planning horizon? i.e. depth
The maximum depth is 20 (default). The actual plan is only 2 steps (it needs to do a move action and then do this rotate action)
Are you sure the reward function is yielding high reward on done?
Yes done gives the reward in this case.
I'm just wondering, in the case of done not giving a reward - this path should not be the best path and consequently, a different path with done giving the reward should be picked?
Or is it possible there are different paths from each of the nodes that give a reward and the done returns a 0 so this path is picked? I'm not able to clearly see the case where this path will be picked if done does not return a high reward
I see that you only run with 300 simulations. Another possibility is 300 is not enough for the tree to be built beyond the done step. What happens if you try a larger number, 1000?
I tried it with 2000 simulations and it's definitely better.
I would assume there should be nothing after the done step? because once done is called, there is no more reward to be attained (even the reward function checks if the robot is done and returns nothing), so I'm just wondering how does building the tree beyond done help in this case?
My guess is without enough simulations, the tree hasn’t even expanded upon “done” yet.
What’s your branching factor? Are you able to prune some actions? Also, consider ActionPrior which enables a custom rollout policy that could make planning more efficient.
My branching factor is about 15.
I am using action prior already - which is picking a subset of these 15 actions (generally 2/3) and picking randomly amongst them. Without that, the planning was never succesful as the branching factor is simply too big.
What exactly do you mean by expanding upon "done"? I understand that for most actions you need to expand on them and see the reward you would get by taking that action and some actions after that. But for done, you get the reward right there and there is no more reward to be found. So, does it matter if "done" has been expanded upon?
Actually this implementation doesn’t stop expanding on a node. You probably want all rewards of future states to be 0 after done is taken.
I meant that when the node is first created, visit count is 0. That’s what you had above.
Ok yea that makes sense. While I did see improvements with larger number of simulations, this part remained the same (number of visits to the done action are still 1/0). I will look into it more. Thank you very much
Hey I just had a couple more questions about the TreeDebugger:
When running the tree debugger, and trying to print dd.mbp
I see that Observation is part of root.children - https://github.com/h2r/pomdp-py/blob/7eeb6538f25eb8597e3c8a643906e525141a62ed/pomdp_py/utils/debugging.py#L648 In this for loop, when I run the debugger, I see that sometimes the c is an action (which is expected) and sometimes c is an observation (I'm unsure why this is the case)
Outside of the maximum depth, what dictates the depth at which the mbp terminates? I know that the rollout goes to depth 20, so I was expecting the tree to go depth 20 as well which should ideally have been reflected in the mbp but it stops at 2/3 (in the cases when done is never called but it is still the best path to take)
- Not sure either. In that case root shouldn't be VNode. Would be good if you can provide an example
- This is expected. As mentioned in another thread, rollouts run till max depth, but the tree expands more slowly -- you getting 2/3 is most likely because the number of simulations wasn't enough to build the tree till max depth. Imagine you only have one simulation, do you expect the tree to grow till max depth? No, it only expands one level deeper from the root node.
1 . Example - so i set a debug point in the dd.mbp function:
Here is the output before I run the loop at Line 647:
for c2 in root.children:
print (c2)
move(7->4)
move(7->2)
move(7->5)
move(7->0)
move(7->3)
Here is the output after I run the loop:
for c2 in root.children:
print (c2)
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:)
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(15, 6))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(15, 5))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(15, 7))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(14, 7))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(16, 5))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(14, 6))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(16, 6))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(14, 5))
MosObservation(r:(((13, 8, 0.0), RobotStatus(done=False), {}, False, []));o:AlarmClock(16, 7))
I'm not sure why the loop is changing the value of anything in the root
I was debugging the best future actions and paths using the Tree Debugger.
When printing dd.mbp - the expected output is something of the sort (from https://h2r.github.io/pomdp-py/html/api/pomdp_py.utils.debugging.html ):
What does the empty list in front of the listen action signify? when I do my dd.mbp, there's a set of actions in that list but I'm unable to understand what exactly do they mean (and mine are after a done action)
dd.mbp :