Beron2022 working example

Model details: I trained a 3-unit GRU, with a linear readout to Q-values (i.e., Q(a=1) and Q(a=0)), on Celia's task where the high reward arm has p=0.8. The observations are $ot = [r{t-1}, a_{t-1}]$, model activity is $zt = GRU(z{t-1}, o_t)$, and model output is $Q_t = z_t^\top w$, where w is a learned linear readout. Here z is 3D, and Q is 2D. I train the model using Q-learning. The model's action policy is ε-greedy; ε is annealed during training, and ε=0 during testing (so ε=0 in all below figures).

Results: Here's Fig 1B-D from Celia's paper, using the trained RQN (recurrent Q network):

Here are some tentative observations so far:

You can train a linear readout to predict B (beliefs) using Z (GRU activity)—this is "\hat{B}" below—and you get a consistently higher "Belief r-squared" than when using an untrained GRU (r^2 ≈ 0.97 vs. r^2 ≈ 0.83)
You can actually just use the model's Q values (instead of Z) and get essentially the same Belief r-squared
And actually, normalizing the Q value differences Q(a=1) – Q(a=0) to range from 0 to 1 looks almost exactly like the beliefs (I'm calling this "ΔQ" in the plots below). I found this surprising since the network's policy is ε-greedy. I'm planning to compare to a softmax policy.

mobeets / q-rnn

Beron2022 working example #5