(Redundant) code in POUCT implementation

In the _simulate function of po_uct.pyx:

        root[action].num_visits += 1
        root.value = root.value + (total_reward - root.value) / (root.num_visits)
        root[action].value = root[action].value + (total_reward - root[action].value) / (root[action].num_visits)

Both the value of root (VNode) and root[action] (QNode) are updated based on total_reward. However, in fact, the algorithm in the paper only requires updating the value of the QNode, i.e. root[action].

I also noticed in the source code of the original author the expected discounted cumulative value is also not maintained in both the VNode and the QNode.

Also in the current POUCT implementation in pomdp_py, commenting out root.value = ... and stick to only updating the QNode's value according to the paper, does not change the output behavior of the planner, since it eventually outputs an action based on the values of the QNodes that are immediate children of the root node. So we should remove this redundant line because it causes confusion.

h2r / pomdp-py

(Redundant) code in POUCT implementation #10