Moving a problem from SARSOP to POMCP

Hororohoruru commented 1 year ago

Hello! I am currently trying to use pomdp_py to explore using POMDP as a framework to help decision-making in Brain-Computer Interfaces (BCI). A while back I asked about the possibility of defining a model that included a time component #21.

TL;DR - These are the problems/questions I have:

Particles is not constructed correctly when passed a list of tuples generated from the Histogram belief representation
I am not sure that the way I worked around the previous problem is correct
Even if it is, when I try to print the current belief (which should be a bunch of 0s, and only a handful of values other than 0), I get what I believe is the list of particles.
Technically not a question, but I also think this would be a useful piece of info to include in the Tiger problem's documentation
Also, thank you very much for this fantastic library. It is awesome and has made my life much easier in the last year.

After finishing a first work exploring BCIs using a simple POMDP that relied on SARSOP for solving. I am now trying to move to POMCP to solve a problem similar to what we discussed in #21.

I ran into a problem when trying to call pomcp.plan() in POMCP as I was initializing my belief like the following:

import pomdp_py
import itertools

n_targets = 12  # For example
n_steps = 8  # Number of time-steps before the true state changes

# Full state space
all_states = [TDState(state_id, t_step) for state_id, time_step 
                       in itertools.product(range(n_targets), range(n_steps))]
# States at time-step 0 only
all_init_states = [TDState(state_id, 0) for state_id in n_targets]

# Initialize belief with a uniform probability over n_targets for all states at time-step 0
init_belief = pomdp_py.Histogram({state: 1 / n_targets if state in all_init_states 
                                  else 0 for state in all_states})

I got then an error saying that the belief needs to be represented as particles in order to use POMCP. I notices this is not present in the Tiger's example in the documentation. After looking around for a bit, I found the Particles module. However, I do not understand how to construct it. I tried to do it in the following way (starting from the previous init_belief)

belief_dict = init_belief.get_histogram()
belief_tuples = [(value, weight) for value, weight in init_belief]

init_belief_particles = pomdp_py.Particles(belief_tuples)

But then I get the entire tuples (both State and its probability) as values, and all weights are set to None. I ended up making it like the following:

init_belief_particles = pomdp_py.Particles([]).from_histogram(init_belief)

That way it looks like it works. However, when I start the simulation I have the following piece of code where I print the current belief (right now as it was for the histogram representation of the belief):

for trial_n in range(n_trials):

    # Separate next trial and label
    next_trial = X_test[trial_n, :, :]
    next_label = int(y_test[trial_n])

    # Set belief to uniform, in case last trial ended without a decision
    vep_problem.agent.set_belief(init_belief_particles)

    # Set the true state as the env state (time step = 0)
    true_state = TDState(next_label, 0)
    vep_problem.env.apply_transition(true_state)

    print('')
    print(f'TRIAL {trial_n} (true state {true_state})')
    print('-' * 20)

    # For every time step...
    for step_n in range(n_steps):
        cur_belief = [vep_problem.agent.cur_belief[st] for st in vep_problem.agent.cur_belief]
        print('')
        print(f'  STEP {step_n}')
        print('  Current belief:')
        print(f'  {cur_belief}')

And then I get a very long list where none of the values is 0. At any time step, only n_targets states should be different than 0, as the initial belief is specified as stated previously and the transition model (not shown) specifies that all states can only go to the same state with an increase in time_step. That said, I don't know if what I am getting is the belief or rather the list of particles (my understanding of how the particle representation works is limited, even after reading the POMCP paper, so it may be that). If that is the case, I wonder if it is correct calling the belief with something like problem.agent.cur_belief.get_histogram().

Thank you very much for your time.

zkytony commented 1 year ago

Looks like the main issue is how to use Particles.

If you look at the file particles.pyx, there are two classes:

WeightedParticles

 cdef class WeightedParticles(GenerativeDistribution):
    """
    Represents a distribution :math:`\Pr(X)` with weighted particles, each is a
    tuple (value, weight). "value" means a value for the random variable X. If
    multiple values are present for the same value, will interpret the
    probability at X=x as the average of those weights.

    __init__(self, list particles, str approx_method="none", distance_func=None)

    Args:
       particles (list): List of (value, weight) tuples. The weight represents
           the likelihood that the value is drawn from the underlying distribution.
       approx_method (str): 'nearest' if when querying the probability
            of a value, and there is no matching particle for it, return
            the probability of the value closest to it. Assuming values
            are comparable; "none" if no approximation, return 0.
       distance_func: Used when approx_method is 'nearest'. Returns
           a number given two values in this particle set.
    """
    def __init__(self, list particles, str approx_method="none", distance_func=None):
        ...

The method you tried (constructing a list of (value, probability) tuples) actually suits this class.

Particles

cdef class Particles(WeightedParticles):
    """ Particles is a set of unweighted particles; This set of particles represent
    a distribution :math:`\Pr(X)`. Each particle takes on a specific value of :math:`X`.
    Inherits :py:mod:`~pomdp_py.representations.distribution.particles.WeightedParticles`.

    __init__(self, particles, **kwargs)

    Args:
        particles (list): List of values.
        kwargs: see __init__() of :py:mod:`~pomdp_py.representations.distribution.particles.WeightedParticles`.
    """
    def __init__(self, particles, **kwargs):

So Particles is a specific variant of WeightedParticles where all particles have equal weight. Therefore, it does not require you to specify the weight for each particle. That's why when you construct a Particles object in the way you tried, you get probability as part of a particle too.

The original paper of POMCP uses particles with the same weight (refer to the original paper). That's why pomdp_py requires the agent's belief to be a Particles object.

I got then an error saying that the belief needs to be represented as particles in order to use POMCP. I notices this is not present in the Tiger's example in the documentation.

Thanks for pointing this out. That's a mistake in the Tiger documentation. In the code tiger_problems.py, you can find this line that converts a histogram belief into a particles belief: https://github.com/h2r/pomdp-py/blob/master/pomdp_problems/tiger/tiger_problem.py#L328

Indeed, it would be nice to add a documentation regarding the different distributions in pomdp_py. This will be a todo.

Hororohoruru commented 1 year ago

Thank you very much for your response! I must have misread the documentation, I see. However, in my problem I am using a belief where only a handful of states are nonzero. Will I run into any problems for using a belief like this? Otherwise I will do it like in the tiger example.

Still, I have the issue that, when printing cur_belief, I get the list of particles (which is not very informative). Would it be accurate to print the current belief with cur_belief.condense()?

Lastly, I'd be glad to help with fixing the documentation and/or contributing with documentation about the distributions. I will have to modify other parts of the code down the line (like the value_function), so helping the documentation and writing examples could be a better way to understand the codebase better.

zkytony commented 1 year ago

Will I run into any problems for using a belief like this?

Well, I can't guarantee you to not run in to any problem. You should be able to use Particles to represent the belief state in your case.

Would it be accurate to print the current belief with cur_belief.condense()?

This may not give you an accurate summary of your distribution, because your belief is sparse and many states are probably not added as a particle. This function only produces an equivalent distribution that is more compact.

I'd say to be accurate, you have to do:

for s in ALL_STATES:
    # belief is Particles or WeightedParticles 
    print(s, belief[s])

Lastly, I'd be glad to help with fixing the documentation and/or contributing with documentation about the distributions.

Thanks. Your contribution is welcome. Please have a look at: https://github.com/h2r/pomdp-py/issues/25

Hororohoruru commented 1 year ago

Thank you very much. I'll proceed forward with the tips you have me. One last thing related to POMCP: Is there a way of running the planning step offline? I want to try planning once I get the relevant information from brain data and then running the model without running any more simulations. This is due to the POMDP running while new data is acquired and classified in real time to obtain real observations.

If not, another option would be to plan after every state change (i.e., an action other than 'wait' is taken), and instruct the experiment to stop for that interval of time (something like 0.5s).

zkytony commented 1 year ago

You could definitely do planning only once and reuse the plan. The result of POMCP planning is a tree, accessible via agent.tree. I encourage you to check out the tree debugging tool, which helps you understand how to access what's in the tree: https://h2r.github.io/pomdp-py/html/api/pomdp_py.utils.debugging.html

Hororohoruru commented 1 year ago

I will, thank you. So far I managed to make the model work, except for a small thing I am not sure I am doing right. As explained before, the initial belief is sparse at the beginning of a trial and uniform across all states that correspond to state.t == 0:

# Uniform initial belief (for states at time_step 0)
init_belief_hist = pomdp_py.Histogram({state: 1 / n_init_states if state in all_init_states else 0 for 
                                       state in all_states})
init_belief_particles = pomdp_py.Particles.from_histogram(init_belief_hist)

It looks something like this for the first trial:

TRIAL 0 (true state S3)
--------------------

  STEP 0
  Current belief:
        s_0-t_0 -> 0.09
        s_1-t_0 -> 0.087
        s_2-t_0 -> 0.07
        s_3-t_0 -> 0.098
        s_4-t_0 -> 0.083
        s_5-t_0 -> 0.115
        s_6-t_0 -> 0.079
        s_7-t_0 -> 0.073
        s_8-t_0 -> 0.062
        s_9-t_0 -> 0.068
        s_10-t_0 -> 0.102
        s_11-t_0 -> 0.073

I understand that the small differences in each belief are due to the particle representation and as the number of particles increase, the belief would approach uniformity. Anyway, given my transition model, any action other than wait (akin to opening doors in the Tiger problem), as well as wait on the last time step mean that the trial ends and the state of the world changes with a uniform probability to any state at state.t = 0:

def probability(self, next_state, state, action):
    """Returns the probability p(s'|s, a)"""
    if state.t == self.max_t or "wait" not in action.name:  # Last time step or decision
        if next_state.t == 0:
            return 1 / self.n_targets
        else:
            return 0
    else:  # Wait action on time steps other than the last one
        if next_state.t == state.t + 1:  # For the next time step
            if next_state.id == state.id: 
                return 1.0 - 1e-9
            else:  # Other states in the next time step
                return 1e-9
        else:  # Can't travel through time... yet
            return 0

def sample(self, state, action):
    """Randomly samples next state according to transition model"""
    if state.t == self.max_t or "wait" not in action.name:  # Last time step or decision
        all_init_states = self.get_all_states(t_step=0)
        return random.choice(all_init_states)
    else:  # Wait action on time steps other than the last one
        next_step = state.t + 1
        return TDState(state.id, next_step)

def get_all_states(self, t_step=None):
    """Returns a list of all states"""
    if t_step is not None:  # Get list of states for a given time_step
        all_states = [TDState(s, t_step) for s in range(self.n_targets)]
    else:  # All states, all time steps
        all_states = [TDState(s, d) for s, d in itertools.product(range(self.n_targets), range(self.n_steps))]
    return all_states

However, unless I mainly re-initialize my belief at the beginning of each trial (i.e. problem.agent.set_belief(init_belief_particles)), the initial belief for the next trial is heavily skewed towards one or more states. For example:

TRIAL 1 (true state S4)
--------------------

  STEP 0
  Current belief:
        s_0-t_0 -> 0.021
        s_1-t_0 -> 0.447
        s_2-t_0 -> 0.062
        s_3-t_0 -> 0.035
        s_4-t_0 -> 0.037
        s_5-t_0 -> 0.042
        s_6-t_0 -> 0.035
        s_7-t_0 -> 0.209
        s_8-t_0 -> 0.031
        s_9-t_0 -> 0.023
        s_10-t_0 -> 0.035
        s_11-t_0 -> 0.023

As you can see, where before the belief for each state went from ~7% to ~10%, now one state has close to 40% belief initially. This, combined with the fact that the observation noise is high at the earlier time steps, makes it so the correct state's belief can go to 0 by the time it is observed in consecutive time steps.

I suspect this may be due to the last update of each trial. In my POMDP, observations come from the brain data of a given participant, and at the end of each trial (which can be because an action different than wait was taken or because the max number of time steps was reached). The true state remains stable during a trial, but changes when the trial changes, and this change is driven by the user, so it does not depend on the observation obtained in the last step of the previous trial, is it always uniform.

The way the observation model works is by training a decoder beforehand that classifies the user's brain data, and some extra data is used to generate a confusion matrix of the classifier. This is what the observation model uses, and the probability of the observation matching the state is higher as time progresses (more data available thus better decoding). This is the code for the observation model:

class TDObservationModel(ObservationModel):
    """
    Parameters
    ----------

    Features: 3-D np.array, shape (n_steps, n_states, n_observations) 
        Feature array for the observation matrix. 

    Attributes
    ----------

    observation_matrix: 3D np.array, (n_timesteps, n_class, n_observation)
        Matrix representing the observation model, where each element represents the probability of obtaining
        the observation corresponding on the third dimension given that the agent is currently at the state 
        corresponding to the second simension and the current time step of the trial is that of the first dimension.

        Example:
            observation_matrix[3][2][5] = p(o=o_5|s=s_2, d=3)
    """
    def __init__(self, features):
        self.observation_matrix = features
        self.n_steps, self.n_states, self.n_obs = self.observation_matrix.shape

    def probability(self, observation, next_state, action):
        # Probability of obtaining a new observation given the next state and the time step
        obs_idx = observation.id
        state_idx = next_state.id
        state_step = next_state.t
        return self.observation_matrix[state_step][state_idx][obs_idx]

    def sample(self, next_state, action):
        # Return a random observation according to the probabilities given by the confusion matrix
        state_idx = next_state.id
        state_step = next_state.t
        obs_p = self.observation_matrix[state_step][state_idx]
        return np.random.choice(self.get_all_observations(), p=obs_p)

Should I make a special observation that is always returned by sample() whenever the trial changes? In case the problem is somewhere else, here is the main loop of my problem:

for trial_n in range(n_trials):
    # Separate next trial and label
    next_trial = X_test[trial_n, :, :]
    next_label = int(y_test[trial_n])

    # Set belief to uniform, in case last trial ended without a decision
    # vep_problem.agent.set_belief(init_belief_particles)

    print('')
    print(f'TRIAL {trial_n} (true state S{next_label})')
    print('-' * 20)

    # For every time step...
    for step_n in range(n_steps):
        # Set the true state as the env state
        true_state = TDState(next_label, step_n)
        vep_problem.env.apply_transition(true_state)

        print('')
        print(f'  STEP {step_n}')
        print('  Current belief:')
        cur_belief = vep_problem.agent.cur_belief
        print_belief(cur_belief, all_states, indent=8)

        # Get your action and execute it
        action = pomcp.plan(vep_problem.agent)
        reward = vep_problem.env.state_transition(action, execute=False)
        print('')
        print(f'  Action: {action.name}')
        print(f'  Reward: {reward}')

        # Add your reward
        total_reward += reward[-1]
        if reward[-1] == miss_cost:
            false_positives += 1

        # Predict and get observation
        t_start = 0
        t_end = time_steps[step_n]

        # Get prediction from the corresponding model
        sample_end = int(t_end * params['sfreq'])

        pred = int(best_model[step_n].predict(next_trial[..., :sample_end])[0])
        observation = BCIObservation(pred)
        print(f'  Observation: {observation.name}')

        # Belief update
        pomcp.update(vep_problem.agent, action, observation)

        # Go to next trial if action is taken
        if action.name != 'a_wait':
            decision_time = t_end
            total_time += decision_time
            beliefs.append((action, cur_belief))
            print('')
            print(f'Action {action.name} selected. Trial ended.')
            print(f'Decision took {decision_time}s')
            break

    # End of the trial. If the last action was 'wait', add a missed trial
    if action.name == 'a_wait':
        total_time += epoch_len
        misses += 1
        print(f'No decision was taken, trial stopped at {t_end}')
    print("End of the trial")

zkytony commented 1 year ago

I will try to look more into this.

Could you briefly answer a quick question: are trials independent? or is trial 0 supposed to influence trial 1?

Hororohoruru commented 1 year ago

Not at all, the true state of a trial has nothing to do on the state of the next trial.

zkytony commented 1 year ago

But it looks like vep_problem.agent's belief is not reset after a trial ends.

Hororohoruru commented 1 year ago

It is commented out at the beginning of the trial, yeah:

for trial_n in range(n_trials):
    # Separate next trial and label
    next_trial = X_test[trial_n, :, :]
    next_label = int(y_test[trial_n])

    # Set belief to uniform, in case last trial ended without a decision
    # vep_problem.agent.set_belief(init_belief_particles)

The thing is that, even if I force the belief to uniform, the simulation of the POMCP algorithm is not taking this belief update into account when planning or sampling. So I thought I should make this belief update happen when updating the model on the last step.

If I understood correctly, when you call POMCP.update(problem.agent, action, observation), you move the tree forward assuming both action and observation are real ones, and also you update the agent's belief. However, if the update of the tree does not naturally take the belief to uniform in the last time step, maybe there would be a mismatch between the tree and the belief?

I am not sure, my understanding of how the tree works is limited, I'm afraid. To support my point of view, the Tiger problem does not manually re-initialize the belief after opening a door, and the problem I am working on is very similar to that one. The only difference is that on the Tiger problem you can go on forever until you open a door, and here you can either 'open a door' or reach the last time step. In both cases the world transitions into new state that is independent from the previous one.

Maybe I should update my observation model to yield observations with uniform probability for whenever the trial ends?

zkytony commented 1 year ago

Belief update should happen at every time step (during a trial).

In both cases the world transitions into new state that is independent from the previous one.

I disagree. In Tiger, there is no state change no matter what action is taken.

the simulation of the POMCP algorithm is not taking this belief update into account when planning or sampling.

Could you elaborate on this? Are you saying POMCP is still planning with the old belief even after pomcp.update? That shouldn't happen.

However, if the update of the tree does not naturally take the belief to uniform in the last time step, maybe there would be a mismatch between the tree and the belief?

Suppose action a is taken. With POMCP, you could have a mismatch between the real observation and the observations that are children of the Q-node corresponding to action a. This causes particle deprivation. I am not sure if this is what you are talking about.

Hororohoruru commented 1 year ago

I disagree. In Tiger, there is no state change no matter what action is taken.

I see, I was under the impression that the state could change after opening a door. I think I mixed two things, allow me to reformulate:

In the Tiger problem, when you listen, you get an observation according to the noise parameter. In my problem, the listen equivalent is 'wait', and when you wait you get an observation based on a previously estimated confusion matrix. However, when you 'open' in the Tiger problem, you get an observation with uniform probability. In my problem, this is not implemented and I think this is where the problem comes from.

In more detail, once you take a decision in my problem (any action other than wait), obtaining an observation from the data in that time-step is a modelling error (I think). I believe in that case the observation should be sampled at random, similar to when you open in the tiger problem. Changing the observation model to the following:

    def probability(self, observation, next_state, action):
        # Probability of obtaining a new observation given the next state and the time step
        if next_state.t == self.n_steps or "wait" not in action.name:  # Last time step or decision
            return 1 / self.n_states

        else:  # Wait action on other time steps
            obs_idx = observation.id
            state_idx = next_state.id
            state_step = next_state.t
            return self.observation_matrix[state_step][state_idx][obs_idx]

    def sample(self, next_state, action):
        # Return a random observation according to the probabilities given by the confusion matrix
        if next_state.t == self.n_steps or "wait" not in action.name:  # Last time step or decision
            return np.random.choice(self.get_all_observations())

        else:  # Wait action on other time steps
            state_idx = next_state.id
            state_step = next_state.t
            obs_p = self.observation_matrix[state_step][state_idx]
            return np.random.choice(self.get_all_observations(), p=obs_p)

Fixed the issue of the skewed beliefs at the beginning of each trial. Do you think I should still manually re-initialize the belief because the trials are independent from each other?

Could you elaborate on this? Are you saying POMCP is still planning with the old belief even after pomcp.update? That shouldn't happen.

No, that is not what I meant. Again, I think the confusion comes from me mixing the belief with how the observation sampling is modelled at the last time step. After reading your comments, I realized that I am missing a time step in my code. Right now, my code does the following:

Initialize belief to uniform on time_step 0
Iterate from 0 to N (N = maximum length of brain data available)
- Plan and get an action
- Get an observation based on the data
- Update the tree and the belief with that observation
- If the action was not wait, break and go to the next trial

The problem with this is that, at time step N, the belief is updated with the maximum available data for that trial. After that, the trial is ended, where there should be a N+1 step to take an action with that belief, and then end the trial no matter the action. After this action is taken, there is no data to get an observation from, and it doesn't matter because the state is going to change at random. I believe I should randomly sample an observation and then update with the observation model I posted above.

Suppose action a is taken. With POMCP, you could have a mismatch between the real observation and the observations that are children of the Q-node corresponding to action a. This causes particle deprivation. I am not sure if this is what you are talking about.

Yes, this is what I was worried about. Since I was missing a last step in my model, the POMCP simulation assumes all observations come from the data and influence the next step. With the new observation model, the POMCP simulation knows that reaching the last time_step, the probability of obtaining any observation is uniform. I think there was no particle depravation in my case because I was running the model with the assumption that there is data available to get observations from at any time step. Now I am creating my model with an extra time step and everything seems to be working well.

Still, there would be any difference between manually setting the belief to uniform at the beginning of each trial?

zkytony commented 1 year ago

To ensure independence between trials, I would re-create a problem instance at the start of a trial.

Hororohoruru commented 1 year ago

When re-creating a problem instance at the end of each trial, is it necessary to also re-create the POMCP instance?

zkytony commented 1 year ago

It should be fine to use the same instance, since you would pass in the agent when you do pomcp.plan(agent). Unless the rollout policy used to create the POMCP instance is agent-specific.

It is not costly at all to create a POMCP instance. If you want to be safe you can create a new one.

Hororohoruru commented 1 year ago

Roger that, thank you very much! I will close the issue now.

h2r / pomdp-py

Moving a problem from SARSOP to POMCP #27