mobeets / sarsa-rnn

0 stars 0 forks source link

policy gradient solving PerceptualDecisionMaking #9

Closed mobeets closed 2 years ago

mobeets commented 2 years ago

Okay so I think the simplest task is PerceptualDecisionMaking. Though I do feel like it would be even simpler if there was just one observation (rather than two), akin to a left vs. right motion direction type of thing. In any case, compared to the default code, I made the following changes:

My network is a GRU with 10 hidden units, trained with $\gamma = 0.9$, using Adam with lr = 0.002. The GRU's initial hidden state is all zeros, and this is reset at the beginning of each trial. I took gradient steps every 5 episodes (i.e., batch_size = 5), and normalized the discounted rewards across all timesteps in the batch.

Training takes about 3000 episodes:

And here's how the network's predictions evolve over time:

These responses are given a noiseless signal. t=0 is fixation, t=T is decision time, and all other times are the stimulus. The output is the probability of outputting the correct action, where a negative coherence means the correct answer was P(a == 0).

So a few notes:

mobeets commented 2 years ago

Also, if you initialize the hidden state to have a one in a single dimension, it can more easily learn to not act at all during fixation.

mobeets commented 2 years ago

Okay, now doing a custom variant of the task where there's only one stimulus dimension, and the task is to decide whether the mean is negative or positive.

Below is the network's responses to noise-free trials, after 5000 trials of training. Also now normalizing after removing the prob of responding null.

mobeets commented 2 years ago

Let RNN choose when to respond

Next goal: set early_response=True, and try to get the network to only respond once it is confident. To encourage this, we may need to increase the failure penalty.

With {'fail': -1}, the network still responds on the very first timestep. Here's what its action probabilities look like if I force it to wait:


Now I'm trying sigma=2.0 and {'fail': -5}. Here's the two unnormalized response probabilities, where coh < 0 now means a leftward stimulus, and coh > 0 means a rightward stimulus.

For rightwards stimuli, it looks like the network is close to integrating, but not so for leftwards stimuli. Ideally the more samples you see, the more likely you should respond.

mobeets commented 2 years ago

Okay so suppose $X_t \sim N(\mu, \sigma)$, and that we want to estimate $\mu$ as:

$$ \mut = \frac{1}{t} \sum{i=1}^{t} X_t $$

Then $\mu_t \sim N(\mu, \sigma/t)$.

mobeets commented 2 years ago

Okay now training only on cohs=[12.8] but testing on all, this seems to work much better:

And here's the hidden activity:

mobeets commented 2 years ago

Trained with sigma = 3. Now plotting average accuracy and RT (± SE, 25 repeats):

mobeets commented 2 years ago

Okay, I'm closing this. Summary of what I have achieved so far:

mobeets commented 2 years ago

One note: I realized after the fact that there was accidentally two observation dimensions (same signal, different noise), so basically two samples per time step. Taking this away seems to make it difficult for the network to learn, which is a little annoying. (Though this is likely just because now the noise level is too large.)