Particle deprivation in POMCP at the end of the trial

Hororohoruru commented 1 year ago

Hello! While running a POMCP problem, I'm sometimes getting particle deprivation at the end of a trial. It is the same problem as in #32 and #27, so it is a finite-horizon problem that has N states and observations, with N+1 possible actions (i.e. one action per state and the 'wait' action).

Since real observations are coming from data, the observation model is constructing by predicting from this data with a decoder to obtain $p(o|s')$ (the action does not change the observation function as of now). The trial ends when any action other than 'wait' is taken or when the last time-step of a trial is reached without any action. When this happens, a new trial starts, and the transition probability is uniform to all the possible states.

I noticed sometimes I get particle deprivation right after the trial ends. After looking at the code, I saw that particle deprivation can happen if there is an observation that was never anticipated in the tree . I had a look and saw that all trials have a much higher count for particle reinvigoration at the last time step compared to others (n_particles = 1000), for example:

TRIAL 20 (true state s_6-t_0)
--------------------

  STEP 0 (6 steps remaining)
  Current belief (based on 0s of data):
        s_0-t_0 -> 0.08
        s_1-t_0 -> 0.083
        s_2-t_0 -> 0.079
        s_3-t_0 -> 0.09
        s_4-t_0 -> 0.087
        s_5-t_0 -> 0.091
        s_6-t_0 -> 0.083
        s_7-t_0 -> 0.079
        s_8-t_0 -> 0.078
        s_9-t_0 -> 0.079
        s_10-t_0 -> 0.091
        s_11-t_0 -> 0.08

  Action: a_wait
  Reward: -1.0. Transition to s_6-t_1
  Observation: o_6
Particle reinvigoration for 442 particles

  STEP 1 (5 steps remaining)
  Current belief (based on 0s of data):
        s_0-t_1 -> 0.023
        s_1-t_1 -> 0.039
        s_2-t_1 -> 0.024
        s_3-t_1 -> 0.027
        s_4-t_1 -> 0.033
        s_5-t_1 -> 0.024
        s_6-t_1 -> 0.677
        s_7-t_1 -> 0.033
        s_8-t_1 -> 0.019
        s_9-t_1 -> 0.052
        s_10-t_1 -> 0.025
        s_11-t_1 -> 0.024

  Action: a_wait
  Reward: -1.0. Transition to s_6-t_2
  Observation: o_6
Particle reinvigoration for 670 particles

  STEP 2 (4 steps remaining)
  Current belief (based on 0s of data):
        s_0-t_2 -> 0.004
        s_3-t_2 -> 0.002
        s_4-t_2 -> 0.005
        s_6-t_2 -> 0.977
        s_7-t_2 -> 0.003
        s_9-t_2 -> 0.006
        s_10-t_2 -> 0.003

  Action: a_wait
  Reward: -1.0. Transition to s_6-t_3
  Observation: o_6
Particle reinvigoration for 472 particles

  STEP 3 (3 steps remaining)
  Current belief (based on 0s of data):
        s_6-t_3 -> 1.0

  Action: a_6
  Reward: 10.0. Transition to s_10-t_0
Particle reinvigoration for 920 particles

This makes sense to me as the code is sampling an observation at random instead of getting it from data when the trial is going to end. Given this, I can see why usually it gets an observation that has not being simulated a lot in the tree. I get particle deprivation sometimes, so I guess it depends on how the simulation goes and which observations are sampled.

Since trials are independent (i.e., I am creating a new instance of the problem for every trial and resetting the belief), I wonder if it makes sense to update the belief when the trial ends at all. Not doing it would avoid this issue with particle deprivation, if I understood correctly. Even so, I am wondering whether this method of providing observations during the last time step is correct. To clarify, the observation model currently looks like this:

class TDObservationModel(ObservationModel):
    """
    Time-dependent extension of the ObservationModel class that takes into account
    the time-step within each POMDP trial and allows the observation function to have
    different observation probabilities depending on the time step.

    This allows the time-dependent POMDP to leverage the fact that the more a trial
    advances, the more brain data from the subject is available. Thus, the probability
    p(o | s, a, d) (where d is the time step) should be less uncertain the longer the
    trial is. 

    This also removes the contraint present in the basic model where the
    initial time step of each trial needs a sufficiently large brain data window to
    yield good classification (e.g. 0.5s), since restrictions on the previous observation
    function required all time steps to use data windows of the same length

    Parameters
    ----------

    Features: 3-D np.array, shape (n_steps, n_states, n_observations) 
        Feature array for the observation matrix. 

    Attributes
    ----------
    discretization: str, ['conf_matrix']
        Method used to define the observation model. Value 'conf_matrix' uses a 3D confusion 
        matrix obtained from stacking the confusion matrix of the decoding algorithm using
        a given brain data window length.

    observation_matrix: 3D np.array, (n_timesteps, n_class, n_observation)
        Matrix representing the observation model, where each element represents the probability of obtaining
        the observation corresponding on the third dimension given that the agent is currently at the state 
        corresponding to the second simension and the current time step of the trial is that of the first dimension.

        Example:
            observation_matrix[3][2][5] = p(o=o_5|s=s_2, d=3)
    """
    def __init__(self, features, discretization='conf_matrix'):
        self.discretization = discretization
        self.observation_matrix = self._make_obs_matrix(features)
        self.n_steps, self.n_states, self.n_obs = self.observation_matrix.shape

    def probability(self, observation, next_state, action):
        # Probability of obtaining a new observation given the next state and the time step
        if next_state.t == 0:  # If next_state.t is 0, either state was last step or action was decision 
            return 1 / self.n_states

        else:  # Wait action on other time steps
            obs_idx = observation.id
            state_idx = next_state.id
            state_step = next_state.t - 1  # observation_matrix[0] corresponds to when next_state.t is 1
            return self.observation_matrix[state_step][state_idx][obs_idx]

    def sample(self, next_state, action):
        # Return a random observation according to the probabilities given by the confusion matrix
        if next_state.t == 0:  # If next_state.t is 0, either state was last step or action was decision 
            return np.random.choice(self.get_all_observations())

        else:  # Wait action on other time steps
            state_idx = next_state.id
            state_step = next_state.t - 1  # observation_matrix[0] corresponds to when next_state.t is 1
            obs_p = self.observation_matrix[state_step][state_idx]
            return np.random.choice(self.get_all_observations(), p=obs_p)

zkytony commented 1 year ago

I had a look and saw that all trials have a much higher count for particle reinvigoration at the last time step compared to others

Having high particle reinvigoration could be totally normal. It's a different issue from particle deprivation. If the number keeps growing, the suggests somehow the true observation is less predictable from the current particles. That might be a property of the problem itself.

Since trials are independent (i.e., I am creating a new instance of the problem for every trial and resetting the belief), I wonder if it makes sense to update the belief when the trial ends at all.

Since trials are independent, you only need to make sure the belief is correct at the start of a trial.

Hororohoruru commented 1 year ago

That might be a property of the problem itself

If that were the case, what can I do to avoid this occasional particle deprivation? I can circunvent it by just avoiding the final update, but I am curious about potential solutions

Since trials are independent, you only need to make sure the belief is correct at the start of a trial.

Great, I will do that then, thank you!

h2r / pomdp-py

Particle deprivation in POMCP at the end of the trial #33