Closed mobeets closed 1 year ago
Actually, we get MUCH more dramatic differences between trained vs. random RQN's when using a random policy (i.e., randomly choose action each trial).
Before I was using the trained RQN as the behavioral policy, and just tracking the untrained RQN's activity as the hidden state. But the random policy may explore the true sample space more evenly (at least in this task), letting us more fairly assess the hidden representations of the RQN.
Model details: I trained a 3-unit GRU, with a linear readout to Q-values (i.e., Q(a=1) and Q(a=0)), on Celia's task where the high reward arm has p=0.8. The observations are $ot = [r{t-1}, a_{t-1}]$, model activity is $zt = GRU(z{t-1}, o_t)$, and model output is $Q_t = z_t^\top w$, where w is a learned linear readout. Here z is 3D, and Q is 2D. I train the model using Q-learning. The model's action policy is ε-greedy; ε is annealed during training, and ε=0 during testing (so ε=0 in all below figures).
Results: Here's Fig 1B-D from Celia's paper, using the trained RQN (recurrent Q network):
Here are some tentative observations so far: