Refactor THTS to support decision trees for factored action spaces

geisserf commented 4 years ago

In our recent SOCS paper we describe how decision nodes in THTS can be replaced by decision trees to better support factored action spaces. In this issue we refactor the implementation and merge it into the master branch.

geisserf commented 4 years ago

We have to adapt the following search components of the planner:

THTS:

additionally has an action decision tree and a current partial action
recommendation function has to be adapted to return best actions
visitDecisionNode has to be adapted for partial actions:
- initializer is given a partial action
- only if we have a full action state we get to a chance node
- otherwise continue with next decision node
- store solved states only for decision tree root nodes
- increase steps to go only in decision tree root nodes
- immediate reward of internal decision tree nodes is 0
Initializer:
heuristic is now a FactoredHeuristic
factored heuristic allows to estimate q value of partial action states
initializer checks whether a chance node or decision node has to be created
we have to discuss how to implement SingleChildInitializer

IDS and MLH:
possibly adapt estimateQValues and checks for reasonable actions, but perhaps this is not required for now

We have to create a new factored heuristics class and implement the following heuristics:
max applicable heuristic
potential heuristic

Parser:
- replace symbol for FDR action fluent naming once this is implemented
Planner:
- estimateBestActions does not return index but an action state

geisserf commented 4 years ago

The following components have been implemented for the paper but will not be merged into the main repository. We may revisit these topics at a later date:

dynamic (non-grounded) decision trees where constraints are checked with a constraint solver
- action states can be created dynamically and do not have an index anymore
- this also implies that action hashkeys are not precomputed in the parser anymore
- a potential heuristic that takes an action state generator and computes action potentials based on action states generated by this generator and their heuristic values.

geisserf commented 4 years ago

Here is how the current implementation in the development branch works:

There is an interface class for decision trees. Each node in a decision tree corresponds to a partial action, while leafs correspond to full action states. Each level in the tree corresponds to the index of an action fluent, therefore the children of a node correspond to the assignment of the corresponding action fluent. A decision tree has a currently active node and this node can be updated by passing a state which will set its children to applicable or inapplicable, depending on whether there are applicable actions consistent with the corresponding partial action.

Right now there is only one implementation of a decision tree: the grounded action tree constructs the complete tree beforehand by iterating through the set of action states. Therefore the number of leafs in the initial tree corresponds to the number of actions.

The tree is used by the following components:

THTS: The tree is constructed initially and in each trial reset which changes the current active tree node back to the root node. A decision node now corresponds to an action fluent assignment and when child nodes are selected by the action selection component the tree is updated with the corresponding value assignment, updating the currently active node. Once a leaf node is reached the corresponding action state is used to compute successor state and reward. Regarding statistics, only decision nodes corresponding to leaf nodes are counted for the number of initialized decision nodes. Conceptually, the root decision node still corresponds to a state where no action fluent was assigned any value. Intermediate decision nodes correspond to partial action assignments and leaf nodes correspond to full action assignments, as described in the SOCS paper. For this reason solved nodes are only cached if they are root nodes of the tree (since this implies that all intermediate nodes are solved as well).
Initialization: This part is different to the description in the paper, as well as to the original paper implementation for design and efficiency reasons. As before, the initialization method is responsible to initialize new (decision or chance) nodes and give them an initial value estimate based on the underlying heuristic. This also holds for the factored representation, but to compute valid successor nodes (i.e. action fluent assignments consistent with the current state) the initializer updates the decision tree with the current state, which updates the applicability status of children of the currently active tree node. Only applicable children are initialized, and if the child is a leaf node a chance node is created, otherwise the created node is a decision node.
Heuristic: The new factored heuristic class allows to compute heuristic values for partial action states by passing a state and a decision tree to the heuristic. A factored heuristic is initialized with a state-action heuristic used to compute h(s,a) where a is a concrete action state. There is currently one implementation of a factored heuristic: the factored max heuristic takes the maximum heuristic value of all action states that are consistent with the partial action, as described in the paper. The implementation of the factored max heuristic is different to the one used for the paper evaluation. Since the initializer already updated which action states are consistent with the current partial action, the factored max heuristic computes the qValues of these applicable actions with the underlying state-action heuristic (estimateQValues). Then, for each action fluent assignment of the currently active node the heuristic value is the maximum qValue of all applicable action states consistent with this assignment. The main difference in the implementation is how consistency is computed: the tree already guarantees that the applicable actions are consistent with the current active node (partial action), i.e. it has only to be checked whether the action state is consistent with the current action fluent assignment (i.e. child node in the tree). This is a simple equality check between the action fluent assignment and the corresponding entry in the action state. If the action fluent assignment would lead to a leaf in the tree we can be even more efficient. In this case there is only a single consistent action state: the action state corresponding to the leaf node. Leaf nodes in the tree store their action state, therefore we can immediately return the qValue of the corresponding action state and do not have to iterate through inconsistent action states.

Right now we do not yet have implemented FDR action fluents and mutex invariants, thus the current implementation is still less efficient than the flattened representation for problems without concurrency. For example, the first step in elevators 1 has ~140k trials in the factored representation and ~240k trials in the flattened representation. To give a peek into the current performance for problems with concurrency, here is the comparison for the first instance of academic advising (2018) between the master branch (flattened representation) and the current factored action implementation:

Factored representation:

THTS round statistics:
  Entries in probabilistic state value cache: 45741
  Buckets in probabilistic state value cache: 62233
  Entries in probabilistic applicable actions cache: 31924
  Buckets in probabilistic applicable actions cache: 520241
  Number of remaining steps in first solved state: 9
  Expected reward in first solved state: 0.000000
  Number of trials in initial state: 32383
  Number of search nodes in initial state: 459992
  Number of reward lock states: 0
  Number of states with only one applicable action: 0
  UCB1 action selection round statistics:
    Percentage exploration in initial state: 0.212828

--------------------------------------------
>>> END OF SESSION  -- TOTAL REWARD: -1575.000000
>>> END OF SESSION  -- AVERAGE REWARD: -52.500000
PROST complete running time: 423.623481

Flattened representation:

THTS round statistics:
  Entries in probabilistic state value cache: 2339
  Buckets in probabilistic state value cache: 62233
  Entries in probabilistic applicable actions cache: 24952
  Buckets in probabilistic applicable actions cache: 520241
  Number of remaining steps in first solved state: 12
  Number of trials in initial state: 2417
  Number of search nodes in initial state: 38669
  Number of reward lock states: 11
  Number of states with only one applicable action: 0
  UCB1 action selection round statistics:
    Percentage exploration in initial state: 0.742242
  THTS heuristic IDS round statistics:
    Entries in deterministic state value cache: 1283982
    Buckets in deterministic state value cache: 2144977
    Entries in deterministic applicable actions cache: 841584
    Buckets in deterministic applicable actions cache: 1056323
    Entries in IDS reward cache: 24948
    Buckets in IDS reward cache: 520241
    Average search depth in initial state: 4.941176
    Total number of runs: 2200
    Total average search depth: 3.680909

--------------------------------------------
>>> END OF SESSION  -- TOTAL REWARD: -1630.000000
>>> END OF SESSION  -- AVERAGE REWARD: -54.333333
PROST complete running time: 325.837064

prost-planner / prost

Refactor THTS to support decision trees for factored action spaces #103