How to correctly implement a goal/terminal state

Hello. When thinking about changing the rollout function as mentioned in #34 , I realized that the model should have a terminal state in order for the POMCP planner to not add any value after taking a decision within a trial. However, after implementing such a state using the reward and transition models, I found that the POMCP planning tree always takes a decision after the first observation, with really low belief on the corresponding state. Since this was not happening before, I was wondering whether I incorrectly implemented the goal state.

First, this is how the tree for a given trial looks

TRIAL 0 (true state s_3-t_0)
--------------------

  STEP 0 (6 steps remaining)
  Current belief (based on 0s of data):
        s_0-t_0 -> 0.096
        s_1-t_0 -> 0.075
        s_2-t_0 -> 0.065
        s_3-t_0 -> 0.062
        s_4-t_0 -> 0.079
        s_5-t_0 -> 0.066
        s_6-t_0 -> 0.089
        s_7-t_0 -> 0.1
        s_8-t_0 -> 0.085
        s_9-t_0 -> 0.087
        s_10-t_0 -> 0.099
        s_11-t_0 -> 0.097

TreeDebugger@
_VNodePP(n=14603, v=-26.635)
    - [0] a_0: QNode(n=1, v=-100.000)
    - [1] a_1: QNode(n=7, v=-84.286)
    - [2] a_10: QNode(n=7, v=-84.286)
    - [3] a_11: QNode(n=7, v=-84.286)
    - [4] a_2: QNode(n=1, v=-100.000)
    - [5] a_3: QNode(n=7, v=-84.286)
    - [6] a_4: QNode(n=1, v=-100.000)
    - [7] a_5: QNode(n=1, v=-100.000)
    - [8] a_6: QNode(n=1, v=-100.000)
    - [9] a_7: QNode(n=1, v=-100.000)
    - [10] a_8: QNode(n=1, v=-100.000)
    - [11] a_9: QNode(n=1, v=-100.000)
    - [12] a_wait: QNode(n=14567, v=-26.635)

(Pdb) dd[12]
a_wait⟶_QNodePP(n=14567, v=-26.635)
    - [0] o_0: VNodeParticles(n=1567, v=-24.875)
    - [1] o_1: VNodeParticles(n=880, v=-29.977)
    - [2] o_10: VNodeParticles(n=1942, v=-24.316)
    - [3] o_11: VNodeParticles(n=1496, v=-30.452)
    - [4] o_2: VNodeParticles(n=1201, v=-26.077)
    - [5] o_3: VNodeParticles(n=788, v=-56.256)
    - [6] o_4: VNodeParticles(n=756, v=-36.302)
    - [7] o_5: VNodeParticles(n=1165, v=-31.310)
    - [8] o_6: VNodeParticles(n=1238, v=-24.548)
    - [9] o_7: VNodeParticles(n=986, v=-28.266)
    - [10] o_8: VNodeParticles(n=1150, v=-31.324)
    - [11] o_9: VNodeParticles(n=1386, v=-29.629)

Here we suppose we get o_3, the observation corresponding to the correct state:

(Pdb) dd[12][5]
o_3⟶_VNodePP(n=788, v=-56.256)
    - [0] a_0: QNode(n=1, v=-100.000)
    - [1] a_1: QNode(n=1, v=-100.000)
    - [2] a_10: QNode(n=1, v=-100.000)
    - [3] a_11: QNode(n=1, v=-100.000)
    - [4] a_2: QNode(n=1, v=-100.000)
    - [5] a_3: QNode(n=772, v=-56.256)
    - [6] a_4: QNode(n=1, v=-100.000)
    - [7] a_5: QNode(n=1, v=-100.000)
    - [8] a_6: QNode(n=1, v=-100.000)
    - [9] a_7: QNode(n=1, v=-100.000)
    - [10] a_8: QNode(n=1, v=-100.000)
    - [11] a_9: QNode(n=1, v=-100.000)
    - [12] a_wait: QNode(n=5, v=-62.203)

What surprises me here is the high number of simulations going to a_3 here given that the only thing that happens is going to the terminal state:

(Pdb) dd[12][5][5]
a_3⟶_QNodePP(n=772, v=-56.256)
    - [0] o_term: VNodeParticles(n=771, v=0.000)

While there are only 5 simulations for the 'wait' action, which in turn would collect more information and has different possible observations:

(Pdb) dd[12][5][12]
a_wait⟶_QNodePP(n=5, v=-62.203)
    - [0] o_3: VNodeParticles(n=0, v=0.000)
    - [1] o_4: VNodeParticles(n=0, v=0.000)
    - [2] o_5: VNodeParticles(n=0, v=0.000)
    - [3] o_6: VNodeParticles(n=0, v=0.000)
    - [4] o_8: VNodeParticles(n=0, v=0.000)

I don't think this behavior is normal... it did not change when changing the exploration constant between the default value and 50. The terminal state is implemented as a state with id 'term' (i.e. s_term). Here are my transition and reward models:

class TDTransitionModel(TransitionModel):
    def __init__(self, n_targets, n_steps):
        super().__init__(n_targets)

        if not isinstance(n_steps, int):
            raise TypeError(f"Invalid number of steps: {n_steps}. It must be an integer.")
        self.n_steps = n_steps
        self.max_t = self.n_steps - 1  # To handle 0 indexing of states and time steps

    def probability(self, next_state, state, action):
        """Returns the probability p(s'|s, a)"""
        # If the current state is the terminal state, transition to itself with probability 1
        if "term" in state.name:
            if "term" in next_state.name:
                return 1.0
            else:
                return 0.0
        else:  # Not terminal state
            if state.t == self.max_t or "wait" not in action.name:  # Last time step or decision
                if "term" in next_state.name:  # Transition to terminal state
                    return 1
                else:
                    return 0
            else:  # Wait action on time steps other than the last one
                if next_state.t == state.t + 1:  # For the next time step
                    if next_state.id == state.id: 
                        return 1.0 - 1e-9
                    else:  # Other states in the next time step
                        return 1e-9
                else:  # Can't travel through time... yet
                    return 0

    def sample(self, state, action):
        """Randomly samples next state according to transition model"""
        # Always sample the terminal state if current state is terminal
        if "term" in state.name:
            return TDState("term", self.max_t)
        else:  # Not terminal state
            if state.t == self.max_t or "wait" not in action.name:  # Last time step or decision
                return TDState("term", 0)
            else:  # Wait action on time steps other than the last one
                next_step = state.t + 1
                return TDState(state.id, next_step)

    def get_all_states(self, t_step=None):
        """Returns a list of all states"""
        if t_step is not None:  # Get list of states for a given time_step
            all_states = [TDState(s, t_step) for s in range(self.n_targets)]
        else:  # All states, all time steps
            all_states = [TDState(s, d) for s, d in itertools.product(range(self.n_targets), range(self.n_steps))]
            all_states.append(TDState("term", 0))
        return all_states

class RewardModel(pomdp_py.RewardModel):
    def __init__(self, hit_reward=10, miss_cost=-100, wait_cost=-1):
        if not all(isinstance(attr, int) for attr in [hit_reward, miss_cost, wait_cost]):
            raise TypeError("All cost/reward values must be integers.")

        self.hit_reward = hit_reward
        self.miss_cost = miss_cost
        self.wait_cost = wait_cost

    def _reward_func(self, state, action):
        """
        The correct action is assumed to be the one that shares ID (i.e., number) with a given state,
        since we assume that each flicker is embedded in a button or actuator. Any action on the 
        terminal state gives a reward of 0. 
        """
        if 'term' in state.name:
            return 0
        else:
            if 'wait' in action.name:
                return self.wait_cost
            else:
                if action.id == state.id:  # HIT
                    return self.hit_reward
                else:  # MISS
                    return self.miss_cost

    def sample(self, state, action, next_state):
        """Deterministic reward"""
        return self._reward_func(state, action)

Here is also my observation_model in case it helps:

class TDObservationModel(ObservationModel):
    def __init__(self, features):
        self.observation_matrix = self._make_obs_matrix(features)
        self.n_steps, self.n_states, self.n_obs = self.observation_matrix.shape

    def probability(self, observation, next_state, action):
        if "term" in observation.name:  # Terminal observation
            if "term" in next_state.name or "wait" not in action.name:  # Transition to terminal state
                return 1
            else:
                return 0
        else:  # Non-terminal observation
            if "term" in next_state.name or "wait" not in action.name:
                return 0
            else:
                obs_idx = observation.id
                state_idx = next_state.id
                state_step = next_state.t - 1  # observation_matrix[0] corresponds to when next_state.t is 1
                return self.observation_matrix[state_step][state_idx][obs_idx]

    def sample(self, next_state, action):
        if "term" in next_state.name or "wait" not in action.name:  # Transition to terminal state
            return BCIObservation('term')
        else:  # Other transitions
            state_idx = next_state.id
            state_step = next_state.t - 1  # observation_matrix[0] corresponds to when next_state.t is 1
            obs_p = self.observation_matrix[state_step][state_idx]
            return np.random.choice(self.get_all_observations(include_terminal=False), p=obs_p)

It seems like you have multiple terminal states, which is quite complex. Do you know your models are implement correctly? Bugs happen a lot (when implementing a POMDP), and it can be very challenging to figure out what's wrong based on the output behavior. (It is even harder for me to do this for you)

My reactions to a few things you said:

"I found that the POMCP planning tree always takes a decision after the first observation, with really low belief on the corresponding state." (I'm not sure what this means)
"which in turn would collect more information" (that's not the criteria by which decisions are made)

I suggest writing unit tests for the components of your POMDP. Tests where you simulate your POMDP based on manually chosen actions are valuable too.

Sorry, I noticed a bug where I created two different terminal states on TransitionModel. There is only one terminal state. As far as model implementation, I tested the probability() functions for the transition, observation and reward models with all combinations of s, s', a and o, and all the outputs were as expected.

"I found that the POMCP planning tree always takes a decision after the first observation, with really low belief on the corresponding state." (I'm not sure what this means)

What I mean is that the belief for the tree starts uniformly distributed among the 12 possible states at the first time step. At this point, the highest valued action is wait (equivalent to Tiger's 'listen'). However, if we look down the tree via indexing, we see that after executing action 'wait' and receiving any observation (o_3 in the first post's example), the best action for the following step is to execute the action corresponding to the received observation. At that point in time, the belief for the state corresponding to that action is around 0.6.

"which in turn would collect more information" (that's not the criteria by which decisions are made)

What I meant is that, given that the belief after one observation is around 0.6, and given the reward for correct decisions (10) and the cost for incorrect ones (100), the agent would not take a decision until the belief for a given state is between .85 or .98 in most cases. Usually after one observation the usual course of action is to select 'wait' multiple times in order to receive more observations.

While I understand that testing this is not possible from your side, I would like to turn to the behavior of the tree after executing action wait and receiving observation o_3 at the beginning (the belief for s_3 is not .60):

(Pdb) dd[12][5]
o_3⟶_VNodePP(n=788, v=-56.256)
    - [0] a_0: QNode(n=1, v=-100.000)
    - [1] a_1: QNode(n=1, v=-100.000)
    - [2] a_10: QNode(n=1, v=-100.000)
    - [3] a_11: QNode(n=1, v=-100.000)
    - [4] a_2: QNode(n=1, v=-100.000)
    - [5] a_3: QNode(n=772, v=-56.256)
    - [6] a_4: QNode(n=1, v=-100.000)
    - [7] a_5: QNode(n=1, v=-100.000)
    - [8] a_6: QNode(n=1, v=-100.000)
    - [9] a_7: QNode(n=1, v=-100.000)
    - [10] a_8: QNode(n=1, v=-100.000)
    - [11] a_9: QNode(n=1, v=-100.000)
    - [12] a_wait: QNode(n=5, v=-62.203)

While I'm not surprised by the value of taking a_3 here, I'm surprised that the value for wait is lower, and I believe it has to do with the number of simulations, since a_3 transitions to the terminal state as shown by indexing the tree again:

(Pdb) dd[12][5][5]
a_3⟶_QNodePP(n=772, v=-56.256)
    - [0] o_term: VNodeParticles(n=771, v=0.000)

In the meantime, only 5 simulations went to wait, and then the potential observations have no simulations in them:

(Pdb) dd[12][5][12]
a_wait⟶_QNodePP(n=5, v=-62.203)
    - [0] o_3: VNodeParticles(n=0, v=0.000)
    - [1] o_4: VNodeParticles(n=0, v=0.000)
    - [2] o_5: VNodeParticles(n=0, v=0.000)
    - [3] o_6: VNodeParticles(n=0, v=0.000)
    - [4] o_8: VNodeParticles(n=0, v=0.000)

I will continue testing, but I would like to know whether you suspect what can be causing the simulations to be so skewed towards a_3 in this case with a belief as low as 0.6. As far as the implementation of the terminal state goes, I just want to know if the elements I introduced (I believe correctly, I will test further) are correct:

Introduce a terminal state s_term that is transitioned to whenever an action other than wait is taken or any action is taken at the last time step of a trial (denoted by TDTransitionModel.max_t)
The reward for any action taken on the terminal state is 0
When the next_state is the terminal state, always return the terminal observation o_term

I think what you’re observing may be a limitation or MCTS. Since the number or simulations is limited, it cannot provide an accurate estimate of all belief states, and tends to exploit and not explore enough (but this could be tuned, and indeed you seem to have tried).

What’s the planning depth? You’re getting 0 visits because either (1) the tree has reached its depth limit or (2) the tree has not expanded there yet due to insufficient number of simulations.

I understand. However, does it make sense to dedicate 99% of the simulations to a path that has a deterministic outcome? Maybe my understanding of the MCTS is limited, but I don't understand why more simulations would go to a path that has a deterministic transition. I understand it exploits the action with the best value, but I don't understand then how wait is initialized to a lower value than taking an immediate decision with a belief as low...

The planning depth starts at 7, so I believe it should have enough in order to expand up to there...

Also, I reverted back to before implementing the terminal state and this was not an issue, so I imagine the problem comes from there.

Update! After playing around with long simulation times / high number of simulations, I believe the problem does indeed come from there. When initially left to plan for 10 seconds, the estimation of the value on the subsequent steps w.r.t the belief is much closer to what is expected.

As such planning times are not allowed by the problem I'm trying to work on, is it possible to run a long planning at the beginning of a POMDP run, save that initial tree, and initialize each trial with that tree? Trials are independent but their initial belief is reset to be identical.

I’m glad!

yes it should be possible. After planning initially, the agent will have a “tree” attribute which represents the policy. The tricky part is if during run time you get observations not on any branch.

Cool! Can I just assign the tree to a variable and then create a new instance of the problem and assign the initial tree to the new agent?

The tricky part is if during run time you get observations not on any branch.

Indeed, that is something that can happen. However, my idea was to keep the online planning I do during the trial. It is not much, but I have time to plan in-between steps. Even if the number of simulations will be dramatically lower, it should help fine-tune the tree after it is pruned. Hopefully that will avoid getting unexpected observations.

Closing this now. Feel free to reopen if comes up again.

h2r / pomdp-py

How to correctly implement a goal/terminal state #38