A modification of the 2afc decision-making task to be more amenable to analysis:
only two actions (-1 or 1), and only two hidden states (-1 or 1).
here, the state changes every 10 time steps (either -1 or 1), actions have no impact on state transitions (this helps fix the data distribution), and the reward function is 0 at all time steps except every 10th time step, where it indicates whether or not the action (-1 or 1) matches the state on the last 10 trials.
in this case, i think you could define an optimal policy directly in terms of beliefs. E.g., if the belief, b(t), is 1D, then your action is a(t) = argmax([b(t), 1-b(t)]).
A modification of the 2afc decision-making task to be more amenable to analysis: only two actions (-1 or 1), and only two hidden states (-1 or 1). here, the state changes every 10 time steps (either -1 or 1), actions have no impact on state transitions (this helps fix the data distribution), and the reward function is 0 at all time steps except every 10th time step, where it indicates whether or not the action (-1 or 1) matches the state on the last 10 trials.
in this case, i think you could define an optimal policy directly in terms of beliefs. E.g., if the belief,
b(t)
, is 1D, then your action isa(t) = argmax([b(t), 1-b(t)])
.