mobeets / q-rnn

0 stars 0 forks source link

Beron2022 fixed points #6

Closed mobeets closed 1 year ago

mobeets commented 1 year ago

regardless of p_rew_max and p_switch:

below: p_rew_max=0.8, p_switch=0.02:

Now comparing the fixed points for all different bvalues of p_rew_max and p_switch:

mobeets commented 1 year ago

todo: does the RNN learn the symmetry of a_prev == r_prev?

Below, the "+" is the empirical fixed points following 100 time steps of having the same observation applied to the RQN (100 random initializations). The dots are the RQN's activity during an actual experiment, with colors indicating the input:

Basically, we would expect the blue and orange fixed points to be the same. But they are not. This kinda makes sense since "r=0" is just encoded as the absence of an input. So it would perhaps be hard for the network to learn the correct symmetry

mobeets commented 1 year ago

So we see that the RNN tends to have four fixed points, rather than two.

I've seen this in two models now, both trained on p=0.8, and H=3. The model on the left is trained with an ε-greedy policy, while the one on the right is trained with a softmax policy. During training these models achieved similar performance. (Below the light dots are from running the models to be noisy, though note that won't change the underlying dynamics, and thus our correspondence with beliefs should be constant regardless of how noisily we sample actions from the models.)

In both models, we see four fixed points. Based on the above logic we'd expect for the blue and orange FPs to be the same, and ditto for the green and red, since these observations suggest the same optimal action (a=1 for blue/orange, a=0 for green/red). Though note that in both cases there is a 1D readout that would give us the FPs we expect. In other words, we could find a readout of the RNN's activity that obeyed this symmetry.

So assuming the latent space is 2D, then there are only two things we can linearly read out from this network: optimal action (i.e., belief), and the reward. What we cannot read out linearly is the chosen action, because these fixed points are at diagonals.

Okay, so we have four inputs—it makes sense we have four fixed points. But what the model effectively chooses is how those four points are arranged in space. Specifically, it chooses which fixed points are diagonally from one another. And the networks above seem to consistently choose inputs with the same action to have fixed points at diagonals. Though this assumes the latent space is 2D, which is not always the case.

mobeets commented 1 year ago

An update to this type of thinking: I think that, given a linear readout to Q, we MUST have that blue/orange are linearly separable from green/red. In other words, in the square layout, blue and orange can never be diagonal.

There are three equivalence classes of 4 fixed points laid out in a square. Two of them are valid, one of them (where blue/orange are diagonals) is not. The reason is that, when blue/orange are on a diagonal of the square, there is no way to draw the ΔQ readout such that blue/orange are on one side, and red/green are on the other.

Because blue (R=0, A=0) and orange (R=1, A=1) need to have the same action (A=1). And likewise for red/green.