mobeets / q-rnn

0 stars 0 forks source link

Beron2022 working example #5

Closed mobeets closed 1 year ago

mobeets commented 1 year ago

Model details: I trained a 3-unit GRU, with a linear readout to Q-values (i.e., Q(a=1) and Q(a=0)), on Celia's task where the high reward arm has p=0.8. The observations are $ot = [r{t-1}, a_{t-1}]$, model activity is $zt = GRU(z{t-1}, o_t)$, and model output is $Q_t = z_t^\top w$, where w is a learned linear readout. Here z is 3D, and Q is 2D. I train the model using Q-learning. The model's action policy is ε-greedy; ε is annealed during training, and ε=0 during testing (so ε=0 in all below figures).

Results: Here's Fig 1B-D from Celia's paper, using the trained RQN (recurrent Q network):

image

image

Here are some tentative observations so far:

mobeets commented 1 year ago

Actually, we get MUCH more dramatic differences between trained vs. random RQN's when using a random policy (i.e., randomly choose action each trial).

Before I was using the trained RQN as the behavioral policy, and just tracking the untrained RQN's activity as the hidden state. But the random policy may explore the true sample space more evenly (at least in this task), letting us more fairly assess the hidden representations of the RQN.