Open geisserf opened 4 years ago
We have to adapt the following search components of the planner:
THTS:
visitDecisionNode has to be adapted for partial actions:
Initializer:
we have to discuss how to implement SingleChildInitializer
IDS and MLH:
possibly adapt estimateQValues and checks for reasonable actions, but perhaps this is not required for now
We have to create a new factored heuristics class and implement the following heuristics:
potential heuristic
Parser:
Planner:
The following components have been implemented for the paper but will not be merged into the main repository. We may revisit these topics at a later date:
Here is how the current implementation in the development branch works:
There is an interface class for decision trees. Each node in a decision tree corresponds to a partial action, while leafs correspond to full action states. Each level in the tree corresponds to the index of an action fluent, therefore the children of a node correspond to the assignment of the corresponding action fluent. A decision tree has a currently active node and this node can be updated by passing a state which will set its children to applicable or inapplicable, depending on whether there are applicable actions consistent with the corresponding partial action.
Right now there is only one implementation of a decision tree: the grounded action tree constructs the complete tree beforehand by iterating through the set of action states. Therefore the number of leafs in the initial tree corresponds to the number of actions.
The tree is used by the following components:
THTS: The tree is constructed initially and in each trial reset which changes the current active tree node back to the root node. A decision node now corresponds to an action fluent assignment and when child nodes are selected by the action selection component the tree is updated with the corresponding value assignment, updating the currently active node. Once a leaf node is reached the corresponding action state is used to compute successor state and reward. Regarding statistics, only decision nodes corresponding to leaf nodes are counted for the number of initialized decision nodes. Conceptually, the root decision node still corresponds to a state where no action fluent was assigned any value. Intermediate decision nodes correspond to partial action assignments and leaf nodes correspond to full action assignments, as described in the SOCS paper. For this reason solved nodes are only cached if they are root nodes of the tree (since this implies that all intermediate nodes are solved as well).
Initialization: This part is different to the description in the paper, as well as to the original paper implementation for design and efficiency reasons. As before, the initialization method is responsible to initialize new (decision or chance) nodes and give them an initial value estimate based on the underlying heuristic. This also holds for the factored representation, but to compute valid successor nodes (i.e. action fluent assignments consistent with the current state) the initializer updates the decision tree with the current state, which updates the applicability status of children of the currently active tree node. Only applicable children are initialized, and if the child is a leaf node a chance node is created, otherwise the created node is a decision node.
Heuristic: The new factored heuristic class allows to compute heuristic values for partial action states by passing a state and a decision tree to the heuristic. A factored heuristic is initialized with a state-action heuristic used to compute h(s,a) where a is a concrete action state. There is currently one implementation of a factored heuristic: the factored max heuristic takes the maximum heuristic value of all action states that are consistent with the partial action, as described in the paper. The implementation of the factored max heuristic is different to the one used for the paper evaluation. Since the initializer already updated which action states are consistent with the current partial action, the factored max heuristic computes the qValues of these applicable actions with the underlying state-action heuristic (estimateQValues). Then, for each action fluent assignment of the currently active node the heuristic value is the maximum qValue of all applicable action states consistent with this assignment. The main difference in the implementation is how consistency is computed: the tree already guarantees that the applicable actions are consistent with the current active node (partial action), i.e. it has only to be checked whether the action state is consistent with the current action fluent assignment (i.e. child node in the tree). This is a simple equality check between the action fluent assignment and the corresponding entry in the action state. If the action fluent assignment would lead to a leaf in the tree we can be even more efficient. In this case there is only a single consistent action state: the action state corresponding to the leaf node. Leaf nodes in the tree store their action state, therefore we can immediately return the qValue of the corresponding action state and do not have to iterate through inconsistent action states.
Right now we do not yet have implemented FDR action fluents and mutex invariants, thus the current implementation is still less efficient than the flattened representation for problems without concurrency. For example, the first step in elevators 1 has ~140k trials in the factored representation and ~240k trials in the flattened representation. To give a peek into the current performance for problems with concurrency, here is the comparison for the first instance of academic advising (2018) between the master branch (flattened representation) and the current factored action implementation:
Factored representation:
THTS round statistics:
Entries in probabilistic state value cache: 45741
Buckets in probabilistic state value cache: 62233
Entries in probabilistic applicable actions cache: 31924
Buckets in probabilistic applicable actions cache: 520241
Number of remaining steps in first solved state: 9
Expected reward in first solved state: 0.000000
Number of trials in initial state: 32383
Number of search nodes in initial state: 459992
Number of reward lock states: 0
Number of states with only one applicable action: 0
UCB1 action selection round statistics:
Percentage exploration in initial state: 0.212828
--------------------------------------------
>>> END OF SESSION -- TOTAL REWARD: -1575.000000
>>> END OF SESSION -- AVERAGE REWARD: -52.500000
PROST complete running time: 423.623481
Flattened representation:
THTS round statistics:
Entries in probabilistic state value cache: 2339
Buckets in probabilistic state value cache: 62233
Entries in probabilistic applicable actions cache: 24952
Buckets in probabilistic applicable actions cache: 520241
Number of remaining steps in first solved state: 12
Number of trials in initial state: 2417
Number of search nodes in initial state: 38669
Number of reward lock states: 11
Number of states with only one applicable action: 0
UCB1 action selection round statistics:
Percentage exploration in initial state: 0.742242
THTS heuristic IDS round statistics:
Entries in deterministic state value cache: 1283982
Buckets in deterministic state value cache: 2144977
Entries in deterministic applicable actions cache: 841584
Buckets in deterministic applicable actions cache: 1056323
Entries in IDS reward cache: 24948
Buckets in IDS reward cache: 520241
Average search depth in initial state: 4.941176
Total number of runs: 2200
Total average search depth: 3.680909
--------------------------------------------
>>> END OF SESSION -- TOTAL REWARD: -1630.000000
>>> END OF SESSION -- AVERAGE REWARD: -54.333333
PROST complete running time: 325.837064
In our recent SOCS paper we describe how decision nodes in THTS can be replaced by decision trees to better support factored action spaces. In this issue we refactor the implementation and merge it into the master branch.