does the Q function relate to beliefs?

Above, the optimal beliefs (orange, blue, green) compared with the trained agent's Q function (brown, red, purple) on an example trial.

One thought I had was: Wait, isn't the optimal policy in some sense the beliefs themselves? And I realized this depends how you select actions. For example, let's say ε=0, for one. Here, the network's action is argmax Q, so in a sense, the Q values of any non-max action are irrelevant. If we instead had softmax action selection, then you would actually want to keep Q(s,a=L)=0 and Q(s,a=R)=0 until it was time to make a decision. But basically, in either case, I don't think it's the case that the actions should reflect the beliefs, without further assumptions.

mobeets / q-rnn

does the Q function relate to beliefs? #1