prost-planner / prost

probabilistic planning system for tasks encoded in RDDL
MIT License
37 stars 17 forks source link

Refactor THTS to support decision trees for factored action spaces #103

Open geisserf opened 4 years ago

geisserf commented 4 years ago

In our recent SOCS paper we describe how decision nodes in THTS can be replaced by decision trees to better support factored action spaces. In this issue we refactor the implementation and merge it into the master branch.

geisserf commented 4 years ago

We have to adapt the following search components of the planner:

THTS:

geisserf commented 4 years ago

The following components have been implemented for the paper but will not be merged into the main repository. We may revisit these topics at a later date:

geisserf commented 4 years ago

Here is how the current implementation in the development branch works:

There is an interface class for decision trees. Each node in a decision tree corresponds to a partial action, while leafs correspond to full action states. Each level in the tree corresponds to the index of an action fluent, therefore the children of a node correspond to the assignment of the corresponding action fluent. A decision tree has a currently active node and this node can be updated by passing a state which will set its children to applicable or inapplicable, depending on whether there are applicable actions consistent with the corresponding partial action.

Right now there is only one implementation of a decision tree: the grounded action tree constructs the complete tree beforehand by iterating through the set of action states. Therefore the number of leafs in the initial tree corresponds to the number of actions.

The tree is used by the following components:

Right now we do not yet have implemented FDR action fluents and mutex invariants, thus the current implementation is still less efficient than the flattened representation for problems without concurrency. For example, the first step in elevators 1 has ~140k trials in the factored representation and ~240k trials in the flattened representation. To give a peek into the current performance for problems with concurrency, here is the comparison for the first instance of academic advising (2018) between the master branch (flattened representation) and the current factored action implementation:

Factored representation:

THTS round statistics:
  Entries in probabilistic state value cache: 45741
  Buckets in probabilistic state value cache: 62233
  Entries in probabilistic applicable actions cache: 31924
  Buckets in probabilistic applicable actions cache: 520241
  Number of remaining steps in first solved state: 9
  Expected reward in first solved state: 0.000000
  Number of trials in initial state: 32383
  Number of search nodes in initial state: 459992
  Number of reward lock states: 0
  Number of states with only one applicable action: 0
  UCB1 action selection round statistics:
    Percentage exploration in initial state: 0.212828

--------------------------------------------
>>> END OF SESSION  -- TOTAL REWARD: -1575.000000
>>> END OF SESSION  -- AVERAGE REWARD: -52.500000
PROST complete running time: 423.623481

Flattened representation:

THTS round statistics:
  Entries in probabilistic state value cache: 2339
  Buckets in probabilistic state value cache: 62233
  Entries in probabilistic applicable actions cache: 24952
  Buckets in probabilistic applicable actions cache: 520241
  Number of remaining steps in first solved state: 12
  Number of trials in initial state: 2417
  Number of search nodes in initial state: 38669
  Number of reward lock states: 11
  Number of states with only one applicable action: 0
  UCB1 action selection round statistics:
    Percentage exploration in initial state: 0.742242
  THTS heuristic IDS round statistics:
    Entries in deterministic state value cache: 1283982
    Buckets in deterministic state value cache: 2144977
    Entries in deterministic applicable actions cache: 841584
    Buckets in deterministic applicable actions cache: 1056323
    Entries in IDS reward cache: 24948
    Buckets in IDS reward cache: 520241
    Average search depth in initial state: 4.941176
    Total number of runs: 2200
    Total average search depth: 3.680909

--------------------------------------------
>>> END OF SESSION  -- TOTAL REWARD: -1630.000000
>>> END OF SESSION  -- AVERAGE REWARD: -54.333333
PROST complete running time: 325.837064