Closed mobeets closed 1 year ago
Okay, finally got a model training. There is now also a null action, with penalty of r=-1 for any action before the go cue. For this model, the aprev/rprev is only nonzero on the time step directly following the decision time (i.e., the last time step of the trial). Though it's possible this isn't necessary once the model learns to not act until the go cue.
In any case, above is a plot of Zpc (H=3) and Q. All time steps are black dots. Red trajectories are trials when S=0 and blue is when S=1. Line is solid for rewarded actions and dashed for unrewarded. The trials shown are from a segment when the state switched from blue to red and then back again. Because p=0.8, the transition is gradual.
We only have technically one fixed point region (the red stars, hard to see). But roughly line attractors at the top of Zpc are for actions, and we get here only when we see the go cue; and then line attractors at the bottom which seem to be the latent belief space.
Okay so timestep level is now definitely working, alongside an abort penalty for acting before the ITI is over. No "censor" required, or special encoding. In other words, the model's inputs are $[r{t-1},~ a{t-1},~ o_t]$ for every time step t, where $o_t$ is a binary indicator of whether or not the agent should make its decision. (There are now three actions: no response, left port, and right port.) I'm also using TD(λ=0.2) just for fun.
If you model the agent's decisions, we still see "perseveration" in that the weight on previous actions is above zero. We also see a smaller dependence on reward omissions than on reward presence, which was also present in behavior but not reported.
The one thing that is NOT yet working is a nonzero minimum ITI. The models I've successfully fit use iti_min=0, iti_p=0.5
. As a result, the model only shows one fixed point, but I'm hoping that once I get a longer minimum ITI we'll see two fixed points.
As a first pass, I'm actually getting pretty good behavior. Basically, each trial has an ITI where the reward is always zero (regardless of the actino), and then on the next step we calculate the reward based on the action. So each trial's length is ITI+1. Also the input that was previously always zero now signals when it is NOT the ITI. So basically, at the time step the agent can act, the input is 1, effectively cueing the agent to respond. (I suspect that, if we were to add a penalty for a non-null action during the ITI, this could work just fine. But then we would have 3 actions instead of 2.)
The thing I haven't thought enough about yet is how to encode aprev and rprev. Currently I am doing this naively, so it is just copied over per timestep. But this means, by the time the ITI is over, the aprev/rprev from the previous trial (which is what really matters) is now totally gone. That may actually be totally fine, because that info should update the network's fixed point. I guess what I'm saying is, now perhaps we expect a line attractor for belief.