Both the value of root (VNode) and root[action] (QNode) are updated based on total_reward. However, in fact, the algorithm in the paper only requires updating the value of the QNode, i.e. root[action].
I also noticed in the source code of the original author the expected discounted cumulative value is also not maintained in both the VNode and the QNode.
Also in the current POUCT implementation in pomdp_py, commenting out root.value = ... and stick to only updating the QNode's value according to the paper, does not change the output behavior of the planner, since it eventually outputs an action based on the values of the QNodes that are immediate children of the root node. So we should remove this redundant line because it causes confusion.
In the _simulate function of
po_uct.pyx
:Both the value of
root
(VNode) androot[action]
(QNode) are updated based ontotal_reward
. However, in fact, the algorithm in the paper only requires updating the value of the QNode, i.e.root[action]
.I also noticed in the source code of the original author the expected discounted cumulative value is also not maintained in both the VNode and the QNode.
Also in the current POUCT implementation in pomdp_py, commenting out
root.value = ...
and stick to only updating the QNode's value according to the paper, does not change the output behavior of the planner, since it eventually outputs an action based on the values of the QNodes that are immediate children of the root node. So we should remove this redundant line because it causes confusion.