prost-planner / prost

probabilistic planning system for tasks encoded in RDDL
MIT License
41 stars 17 forks source link

Wrong description of Monte Carlo backup functions in THTS paper? #111

Closed GCrispino closed 4 years ago

GCrispino commented 4 years ago

Hi!

I don't know if I can't follow them correctly, but it does not seem that the Monte Carlo backup function description in the THTS ICAPS 2013 paper is aligned with the code implementation. There, it describes this function computation for decision nodes (V(nd)) as averaging over its child chance nodes' action value estimates (Q(nc)).

However, there's only one backup function for decision nodes in backup_function.cc, which is BackupFunction::backupDecisionNode (except for BackupFunction::backupDecisionNodeLeaf), and it does not average over its chance nodes. Instead, it returns the value of the maximizing chance node, which looks like the MaxMC definition on the same paper.

So, it looks like the MaxMC functioning in the code is aligned with the paper, but for regular Monte Carlo backups, it doesn't. Am I missing something here or this is really the case?

thomaskeller79 commented 4 years ago

Hi Gabriel,

it is correct that the implementation of MC backups differs from the description in the paper. If you look at the implementation of MCBackupFunction::backupChanceNode, you will notice that the implementation is also different than the one provided in the paper: we use the futReward (i.e., the reward yielded in the current trial after the current node) that is passed to the function instead of computing the weighted average over all successor nodes as described in the THTS paper. However, this produces equivalent values as the backup function that is described in the paper.

The estimate for a decision node is obviously different than described in the paper, as we do not computed weighted averages but set it to the maximum of its children nodes. The influence of the decision node values on the behavior is comparably small, though, as most decisions are based on the chance node estimates alone (i.e., the Q-value estimates for state action pairs). Decision node estimates are used only in one place: to determine the value of the bias parameter B in the UCB1 formula, where we now use max_{n_c \in S(n_d)} Q^k(n_c) instead of V(n_d). However, this is actually an improved estimate of B, as it is closer to V*(n_d), which is the value that would be used if it were available.

tl;dr: backup function is equivalent for (more important) chance nodes, bias parameter estimate is better in implementation

GCrispino commented 4 years ago

First, thanks for answering!

it is correct that the implementation of MC backups differs from the description in the paper. If you look at the implementation of MCBackupFunction::backupChanceNode, you will notice that the implementation is also different than the one provided in the paper: we use the futReward (i.e., the reward yielded in the current trial after the current node) that is passed to the function instead of computing the weighted average over all successor nodes as described in the THTS paper. However, this produces equivalent values as the backup function that is described in the paper.

I did notice! That was another doubt I had.

Also, if I may ask: why didn't you use this future reward approach for backing up nodes from the start (and in the paper description)? I ask this because it seems like it is the "standard" way of updating nodes after rollouts in UCT implementations (update the current avg estimate with the last cumulative reward from the node). So, why use a different approach, even if it is only in the paper?

Decision node estimates are used only in one place: to determine the value of the bias parameter B in the UCB1 formula, where we now use max_{n_c \in S(n_d)} Q^k(n_c) instead of V(n_d).

I didn't really get this. Isn't V(n_d) now equal to max_{n_c \in S(n_d)} Q^k(n_c) (the decision node's estimate is the maximum of its child nodes' estimates)?

thomaskeller79 commented 4 years ago

Also, if I may ask: why didn't you use this future reward approach for backing up nodes from the start (and in the paper description)? I ask this because it seems like it is the "standard" way of updating nodes after rollouts in UCT implementations (update the current avg estimate with the last cumulative reward from the node). So, why use a different approach, even if it is only in the paper?

Mostly for historically reasons: in earlier versions of Prost, it was implemented as in the paper. However, I also found it interesting back then that there is actually a dynamic programming like backup that is equivalent to the "standard" was of updating the nodes.

I didn't really get this. Isn't V(n_d) now equal to max_{n_c \in S(n_d)} Q^k(n_c) (the decision node's estimate is the maximum of its child nodes' estimates)?

Sorry for the confusion. Of course both these expressions are now equal, what I meant is that we use the max formula rather than the weighted average that is used in the paper (the implementation still uses V(n_d), this change is only due to the fact that V(n_d) is now something different).

GCrispino commented 4 years ago

I see! This is interesting indeed. Thanks for answering!