Closed Hororohoruru closed 1 year ago
Looks like the main issue is how to use Particles
.
If you look at the file particles.pyx, there are two classes:
WeightedParticles
cdef class WeightedParticles(GenerativeDistribution):
"""
Represents a distribution :math:`\Pr(X)` with weighted particles, each is a
tuple (value, weight). "value" means a value for the random variable X. If
multiple values are present for the same value, will interpret the
probability at X=x as the average of those weights.
__init__(self, list particles, str approx_method="none", distance_func=None)
Args:
particles (list): List of (value, weight) tuples. The weight represents
the likelihood that the value is drawn from the underlying distribution.
approx_method (str): 'nearest' if when querying the probability
of a value, and there is no matching particle for it, return
the probability of the value closest to it. Assuming values
are comparable; "none" if no approximation, return 0.
distance_func: Used when approx_method is 'nearest'. Returns
a number given two values in this particle set.
"""
def __init__(self, list particles, str approx_method="none", distance_func=None):
...
The method you tried (constructing a list of (value, probability) tuples) actually suits this class.
Particles
cdef class Particles(WeightedParticles):
""" Particles is a set of unweighted particles; This set of particles represent
a distribution :math:`\Pr(X)`. Each particle takes on a specific value of :math:`X`.
Inherits :py:mod:`~pomdp_py.representations.distribution.particles.WeightedParticles`.
__init__(self, particles, **kwargs)
Args:
particles (list): List of values.
kwargs: see __init__() of :py:mod:`~pomdp_py.representations.distribution.particles.WeightedParticles`.
"""
def __init__(self, particles, **kwargs):
So Particles is a specific variant of WeightedParticles where all particles have equal weight. Therefore, it does not require you to specify the weight for each particle. That's why when you construct a Particles object in the way you tried, you get probability as part of a particle too.
The original paper of POMCP uses particles with the same weight (refer to the original paper). That's why pomdp_py requires the agent's belief to be a Particles
object.
I got then an error saying that the belief needs to be represented as particles in order to use POMCP. I notices this is not present in the Tiger's example in the documentation.
Thanks for pointing this out. That's a mistake in the Tiger documentation. In the code tiger_problems.py, you can find this line that converts a histogram belief into a particles belief: https://github.com/h2r/pomdp-py/blob/master/pomdp_problems/tiger/tiger_problem.py#L328
Indeed, it would be nice to add a documentation regarding the different distributions in pomdp_py. This will be a todo.
Thank you very much for your response! I must have misread the documentation, I see. However, in my problem I am using a belief where only a handful of states are nonzero. Will I run into any problems for using a belief like this? Otherwise I will do it like in the tiger example.
Still, I have the issue that, when printing cur_belief
, I get the list of particles (which is not very informative). Would it be accurate to print the current belief with cur_belief.condense()
?
Lastly, I'd be glad to help with fixing the documentation and/or contributing with documentation about the distributions. I will have to modify other parts of the code down the line (like the value_function), so helping the documentation and writing examples could be a better way to understand the codebase better.
Will I run into any problems for using a belief like this?
Well, I can't guarantee you to not run in to any problem. You should be able to use Particles to represent the belief state in your case.
Would it be accurate to print the current belief with
cur_belief.condense()
?
This may not give you an accurate summary of your distribution, because your belief is sparse and many states are probably not added as a particle. This function only produces an equivalent distribution that is more compact.
I'd say to be accurate, you have to do:
for s in ALL_STATES:
# belief is Particles or WeightedParticles
print(s, belief[s])
Lastly, I'd be glad to help with fixing the documentation and/or contributing with documentation about the distributions.
Thanks. Your contribution is welcome. Please have a look at: https://github.com/h2r/pomdp-py/issues/25
Thank you very much. I'll proceed forward with the tips you have me. One last thing related to POMCP: Is there a way of running the planning step offline? I want to try planning once I get the relevant information from brain data and then running the model without running any more simulations. This is due to the POMDP running while new data is acquired and classified in real time to obtain real observations.
If not, another option would be to plan after every state change (i.e., an action other than 'wait' is taken), and instruct the experiment to stop for that interval of time (something like 0.5s).
You could definitely do planning only once and reuse the plan. The result of POMCP planning is a tree, accessible via agent.tree. I encourage you to check out the tree debugging tool, which helps you understand how to access what's in the tree: https://h2r.github.io/pomdp-py/html/api/pomdp_py.utils.debugging.html
I will, thank you. So far I managed to make the model work, except for a small thing I am not sure I am doing right. As explained before, the initial belief is sparse at the beginning of a trial and uniform across all states that correspond to state.t == 0
:
# Uniform initial belief (for states at time_step 0)
init_belief_hist = pomdp_py.Histogram({state: 1 / n_init_states if state in all_init_states else 0 for
state in all_states})
init_belief_particles = pomdp_py.Particles.from_histogram(init_belief_hist)
It looks something like this for the first trial:
TRIAL 0 (true state S3)
--------------------
STEP 0
Current belief:
s_0-t_0 -> 0.09
s_1-t_0 -> 0.087
s_2-t_0 -> 0.07
s_3-t_0 -> 0.098
s_4-t_0 -> 0.083
s_5-t_0 -> 0.115
s_6-t_0 -> 0.079
s_7-t_0 -> 0.073
s_8-t_0 -> 0.062
s_9-t_0 -> 0.068
s_10-t_0 -> 0.102
s_11-t_0 -> 0.073
I understand that the small differences in each belief are due to the particle representation and as the number of particles increase, the belief would approach uniformity. Anyway, given my transition model, any action other than wait
(akin to opening doors in the Tiger problem), as well as wait
on the last time step mean that the trial ends and the state of the world changes with a uniform probability to any state at state.t = 0
:
def probability(self, next_state, state, action):
"""Returns the probability p(s'|s, a)"""
if state.t == self.max_t or "wait" not in action.name: # Last time step or decision
if next_state.t == 0:
return 1 / self.n_targets
else:
return 0
else: # Wait action on time steps other than the last one
if next_state.t == state.t + 1: # For the next time step
if next_state.id == state.id:
return 1.0 - 1e-9
else: # Other states in the next time step
return 1e-9
else: # Can't travel through time... yet
return 0
def sample(self, state, action):
"""Randomly samples next state according to transition model"""
if state.t == self.max_t or "wait" not in action.name: # Last time step or decision
all_init_states = self.get_all_states(t_step=0)
return random.choice(all_init_states)
else: # Wait action on time steps other than the last one
next_step = state.t + 1
return TDState(state.id, next_step)
def get_all_states(self, t_step=None):
"""Returns a list of all states"""
if t_step is not None: # Get list of states for a given time_step
all_states = [TDState(s, t_step) for s in range(self.n_targets)]
else: # All states, all time steps
all_states = [TDState(s, d) for s, d in itertools.product(range(self.n_targets), range(self.n_steps))]
return all_states
However, unless I mainly re-initialize my belief at the beginning of each trial (i.e. problem.agent.set_belief(init_belief_particles)
), the initial belief for the next trial is heavily skewed towards one or more states. For example:
TRIAL 1 (true state S4)
--------------------
STEP 0
Current belief:
s_0-t_0 -> 0.021
s_1-t_0 -> 0.447
s_2-t_0 -> 0.062
s_3-t_0 -> 0.035
s_4-t_0 -> 0.037
s_5-t_0 -> 0.042
s_6-t_0 -> 0.035
s_7-t_0 -> 0.209
s_8-t_0 -> 0.031
s_9-t_0 -> 0.023
s_10-t_0 -> 0.035
s_11-t_0 -> 0.023
As you can see, where before the belief for each state went from ~7% to ~10%, now one state has close to 40% belief initially. This, combined with the fact that the observation noise is high at the earlier time steps, makes it so the correct state's belief can go to 0 by the time it is observed in consecutive time steps.
I suspect this may be due to the last update of each trial. In my POMDP, observations come from the brain data of a given participant, and at the end of each trial (which can be because an action different than wait
was taken or because the max number of time steps was reached). The true state remains stable during a trial, but changes when the trial changes, and this change is driven by the user, so it does not depend on the observation obtained in the last step of the previous trial, is it always uniform.
The way the observation model works is by training a decoder beforehand that classifies the user's brain data, and some extra data is used to generate a confusion matrix of the classifier. This is what the observation model uses, and the probability of the observation matching the state is higher as time progresses (more data available thus better decoding). This is the code for the observation model:
class TDObservationModel(ObservationModel):
"""
Parameters
----------
Features: 3-D np.array, shape (n_steps, n_states, n_observations)
Feature array for the observation matrix.
Attributes
----------
observation_matrix: 3D np.array, (n_timesteps, n_class, n_observation)
Matrix representing the observation model, where each element represents the probability of obtaining
the observation corresponding on the third dimension given that the agent is currently at the state
corresponding to the second simension and the current time step of the trial is that of the first dimension.
Example:
observation_matrix[3][2][5] = p(o=o_5|s=s_2, d=3)
"""
def __init__(self, features):
self.observation_matrix = features
self.n_steps, self.n_states, self.n_obs = self.observation_matrix.shape
def probability(self, observation, next_state, action):
# Probability of obtaining a new observation given the next state and the time step
obs_idx = observation.id
state_idx = next_state.id
state_step = next_state.t
return self.observation_matrix[state_step][state_idx][obs_idx]
def sample(self, next_state, action):
# Return a random observation according to the probabilities given by the confusion matrix
state_idx = next_state.id
state_step = next_state.t
obs_p = self.observation_matrix[state_step][state_idx]
return np.random.choice(self.get_all_observations(), p=obs_p)
Should I make a special observation that is always returned by sample()
whenever the trial changes? In case the problem is somewhere else, here is the main loop of my problem:
for trial_n in range(n_trials):
# Separate next trial and label
next_trial = X_test[trial_n, :, :]
next_label = int(y_test[trial_n])
# Set belief to uniform, in case last trial ended without a decision
# vep_problem.agent.set_belief(init_belief_particles)
print('')
print(f'TRIAL {trial_n} (true state S{next_label})')
print('-' * 20)
# For every time step...
for step_n in range(n_steps):
# Set the true state as the env state
true_state = TDState(next_label, step_n)
vep_problem.env.apply_transition(true_state)
print('')
print(f' STEP {step_n}')
print(' Current belief:')
cur_belief = vep_problem.agent.cur_belief
print_belief(cur_belief, all_states, indent=8)
# Get your action and execute it
action = pomcp.plan(vep_problem.agent)
reward = vep_problem.env.state_transition(action, execute=False)
print('')
print(f' Action: {action.name}')
print(f' Reward: {reward}')
# Add your reward
total_reward += reward[-1]
if reward[-1] == miss_cost:
false_positives += 1
# Predict and get observation
t_start = 0
t_end = time_steps[step_n]
# Get prediction from the corresponding model
sample_end = int(t_end * params['sfreq'])
pred = int(best_model[step_n].predict(next_trial[..., :sample_end])[0])
observation = BCIObservation(pred)
print(f' Observation: {observation.name}')
# Belief update
pomcp.update(vep_problem.agent, action, observation)
# Go to next trial if action is taken
if action.name != 'a_wait':
decision_time = t_end
total_time += decision_time
beliefs.append((action, cur_belief))
print('')
print(f'Action {action.name} selected. Trial ended.')
print(f'Decision took {decision_time}s')
break
# End of the trial. If the last action was 'wait', add a missed trial
if action.name == 'a_wait':
total_time += epoch_len
misses += 1
print(f'No decision was taken, trial stopped at {t_end}')
print("End of the trial")
I will try to look more into this.
Could you briefly answer a quick question: are trials independent? or is trial 0 supposed to influence trial 1?
Not at all, the true state of a trial has nothing to do on the state of the next trial.
But it looks like vep_problem.agent
's belief is not reset after a trial ends.
It is commented out at the beginning of the trial, yeah:
for trial_n in range(n_trials):
# Separate next trial and label
next_trial = X_test[trial_n, :, :]
next_label = int(y_test[trial_n])
# Set belief to uniform, in case last trial ended without a decision
# vep_problem.agent.set_belief(init_belief_particles)
The thing is that, even if I force the belief to uniform, the simulation of the POMCP algorithm is not taking this belief update into account when planning or sampling. So I thought I should make this belief update happen when updating the model on the last step.
If I understood correctly, when you call POMCP.update(problem.agent, action, observation)
, you move the tree forward assuming both action and observation are real ones, and also you update the agent's belief. However, if the update of the tree does not naturally take the belief to uniform in the last time step, maybe there would be a mismatch between the tree and the belief?
I am not sure, my understanding of how the tree works is limited, I'm afraid. To support my point of view, the Tiger problem does not manually re-initialize the belief after opening a door, and the problem I am working on is very similar to that one. The only difference is that on the Tiger problem you can go on forever until you open a door, and here you can either 'open a door' or reach the last time step. In both cases the world transitions into new state that is independent from the previous one.
Maybe I should update my observation model to yield observations with uniform probability for whenever the trial ends?
Belief update should happen at every time step (during a trial).
In both cases the world transitions into new state that is independent from the previous one.
I disagree. In Tiger, there is no state change no matter what action is taken.
the simulation of the POMCP algorithm is not taking this belief update into account when planning or sampling.
Could you elaborate on this? Are you saying POMCP is still planning with the old belief even after pomcp.update
? That shouldn't happen.
However, if the update of the tree does not naturally take the belief to uniform in the last time step, maybe there would be a mismatch between the tree and the belief?
Suppose action a
is taken. With POMCP, you could have a mismatch between the real observation and the observations that are children of the Q-node corresponding to action a
. This causes particle deprivation. I am not sure if this is what you are talking about.
I disagree. In Tiger, there is no state change no matter what action is taken.
I see, I was under the impression that the state could change after opening a door. I think I mixed two things, allow me to reformulate:
In the Tiger problem, when you listen, you get an observation according to the noise parameter. In my problem, the listen equivalent is 'wait', and when you wait you get an observation based on a previously estimated confusion matrix. However, when you 'open' in the Tiger problem, you get an observation with uniform probability. In my problem, this is not implemented and I think this is where the problem comes from.
In more detail, once you take a decision in my problem (any action other than wait), obtaining an observation from the data in that time-step is a modelling error (I think). I believe in that case the observation should be sampled at random, similar to when you open in the tiger problem. Changing the observation model to the following:
def probability(self, observation, next_state, action):
# Probability of obtaining a new observation given the next state and the time step
if next_state.t == self.n_steps or "wait" not in action.name: # Last time step or decision
return 1 / self.n_states
else: # Wait action on other time steps
obs_idx = observation.id
state_idx = next_state.id
state_step = next_state.t
return self.observation_matrix[state_step][state_idx][obs_idx]
def sample(self, next_state, action):
# Return a random observation according to the probabilities given by the confusion matrix
if next_state.t == self.n_steps or "wait" not in action.name: # Last time step or decision
return np.random.choice(self.get_all_observations())
else: # Wait action on other time steps
state_idx = next_state.id
state_step = next_state.t
obs_p = self.observation_matrix[state_step][state_idx]
return np.random.choice(self.get_all_observations(), p=obs_p)
Fixed the issue of the skewed beliefs at the beginning of each trial. Do you think I should still manually re-initialize the belief because the trials are independent from each other?
Could you elaborate on this? Are you saying POMCP is still planning with the old belief even after
pomcp.update
? That shouldn't happen.
No, that is not what I meant. Again, I think the confusion comes from me mixing the belief with how the observation sampling is modelled at the last time step. After reading your comments, I realized that I am missing a time step in my code. Right now, my code does the following:
The problem with this is that, at time step N, the belief is updated with the maximum available data for that trial. After that, the trial is ended, where there should be a N+1 step to take an action with that belief, and then end the trial no matter the action. After this action is taken, there is no data to get an observation from, and it doesn't matter because the state is going to change at random. I believe I should randomly sample an observation and then update with the observation model I posted above.
Suppose action
a
is taken. With POMCP, you could have a mismatch between the real observation and the observations that are children of the Q-node corresponding to actiona
. This causes particle deprivation. I am not sure if this is what you are talking about.
Yes, this is what I was worried about. Since I was missing a last step in my model, the POMCP simulation assumes all observations come from the data and influence the next step. With the new observation model, the POMCP simulation knows that reaching the last time_step, the probability of obtaining any observation is uniform. I think there was no particle depravation in my case because I was running the model with the assumption that there is data available to get observations from at any time step. Now I am creating my model with an extra time step and everything seems to be working well.
Still, there would be any difference between manually setting the belief to uniform at the beginning of each trial?
To ensure independence between trials, I would re-create a problem instance at the start of a trial.
When re-creating a problem instance at the end of each trial, is it necessary to also re-create the POMCP instance?
It should be fine to use the same instance, since you would pass in the agent when you do pomcp.plan(agent). Unless the rollout policy used to create the POMCP instance is agent-specific.
It is not costly at all to create a POMCP instance. If you want to be safe you can create a new one.
Roger that, thank you very much! I will close the issue now.
Hello! I am currently trying to use pomdp_py to explore using POMDP as a framework to help decision-making in Brain-Computer Interfaces (BCI). A while back I asked about the possibility of defining a model that included a time component #21.
TL;DR - These are the problems/questions I have:
Particles
is not constructed correctly when passed a list of tuples generated from theHistogram
belief representationAfter finishing a first work exploring BCIs using a simple POMDP that relied on SARSOP for solving. I am now trying to move to POMCP to solve a problem similar to what we discussed in #21.
I ran into a problem when trying to call
pomcp.plan()
in POMCP as I was initializing my belief like the following:I got then an error saying that the belief needs to be represented as particles in order to use POMCP. I notices this is not present in the Tiger's example in the documentation. After looking around for a bit, I found the
Particles
module. However, I do not understand how to construct it. I tried to do it in the following way (starting from the previousinit_belief
)But then I get the entire tuples (both State and its probability) as values, and all weights are set to
None
. I ended up making it like the following:That way it looks like it works. However, when I start the simulation I have the following piece of code where I print the current belief (right now as it was for the histogram representation of the belief):
And then I get a very long list where none of the values is 0. At any time step, only
n_targets
states should be different than 0, as the initial belief is specified as stated previously and the transition model (not shown) specifies that all states can only go to the same state with an increase in time_step. That said, I don't know if what I am getting is the belief or rather the list of particles (my understanding of how the particle representation works is limited, even after reading the POMCP paper, so it may be that). If that is the case, I wonder if it is correct calling the belief with something likeproblem.agent.cur_belief.get_histogram()
.Thank you very much for your time.