policy gradient solving PerceptualDecisionMaking

mobeets commented 2 years ago

Okay so I think the simplest task is PerceptualDecisionMaking. Though I do feel like it would be even simpler if there was just one observation (rather than two), akin to a left vs. right motion direction type of thing. In any case, compared to the default code, I made the following changes:

timing = {'fixation': 200}, because otherwise there's no fixation period when dt=100 (the default)
the fixation input is on during ['fixation', 'delay'] rather than ['fixation', 'stimulus', 'delay']. I haven't tested whether this is necessary but I don't feel like messing with it yet

My network is a GRU with 10 hidden units, trained with $\gamma = 0.9$, using Adam with lr = 0.002. The GRU's initial hidden state is all zeros, and this is reset at the beginning of each trial. I took gradient steps every 5 episodes (i.e., batch_size = 5), and normalized the discounted rewards across all timesteps in the batch.

Training takes about 3000 episodes:

And here's how the network's predictions evolve over time:

These responses are given a noiseless signal. t=0 is fixation, t=T is decision time, and all other times are the stimulus. The output is the probability of outputting the correct action, where a negative coherence means the correct answer was P(a == 0).

So a few notes:

Note that we start below 0.5 not because there's a bias, but because there are three actions: [null, left, right]
This network responds during the stimulus period in a way that seems to reflect confidence, or integration. What we would really like is a network that only responds once during the trial, so we could get true "decision times". It kinda seems like the outputs of this network might be a reasonable starting point as the inputs to a "decide to act" network.

mobeets commented 2 years ago

Also, if you initialize the hidden state to have a one in a single dimension, it can more easily learn to not act at all during fixation.

mobeets commented 2 years ago

Okay, now doing a custom variant of the task where there's only one stimulus dimension, and the task is to decide whether the mean is negative or positive.

Below is the network's responses to noise-free trials, after 5000 trials of training. Also now normalizing after removing the prob of responding null.

mobeets commented 2 years ago

Let RNN choose when to respond

Next goal: set early_response=True, and try to get the network to only respond once it is confident. To encourage this, we may need to increase the failure penalty.

With {'fail': -1}, the network still responds on the very first timestep. Here's what its action probabilities look like if I force it to wait:

Now I'm trying sigma=2.0 and {'fail': -5}. Here's the two unnormalized response probabilities, where coh < 0 now means a leftward stimulus, and coh > 0 means a rightward stimulus.

For rightwards stimuli, it looks like the network is close to integrating, but not so for leftwards stimuli. Ideally the more samples you see, the more likely you should respond.

mobeets commented 2 years ago

Okay so suppose $X_t \sim N(\mu, \sigma)$, and that we want to estimate $\mu$ as:

$$ \mut = \frac{1}{t} \sum{i=1}^{t} X_t $$

Then $\mu_t \sim N(\mu, \sigma/t)$.

mobeets commented 2 years ago

Okay now training only on cohs=[12.8] but testing on all, this seems to work much better:

And here's the hidden activity:

mobeets commented 2 years ago

Trained with sigma = 3. Now plotting average accuracy and RT (± SE, 25 repeats):

mobeets commented 2 years ago

Okay, I'm closing this. Summary of what I have achieved so far:

Trained an RNN using policy gradient on a POMDP
Each episode is a single trial where the optimal policy requires integrating a stimulus over time; the RNN integrates using its hidden state
The RNN chooses when to respond, where RT is a decreasing function of SNR

mobeets commented 2 years ago

One note: I realized after the fact that there was accidentally two observation dimensions (same signal, different noise), so basically two samples per time step. Taking this away seems to make it difficult for the network to learn, which is a little annoying. (Though this is likely just because now the noise level is too large.)

mobeets / sarsa-rnn

policy gradient solving PerceptualDecisionMaking #9

Let RNN choose when to respond