Open AleShi94 opened 2 years ago
DQN is a form of FQI. What is implemented is a form of FQI too in the sense that a target is computed then regressed. Only difference compared to DQN is the use of the Bellman operator instead of the Bellman optimality operator (with the argmax), which is what we need for policy evaluation. Using DQN might work but it wouldn't be in the mirror descent setting we're following, so needs more investigation. LSTD would be a good idea although it is only for training the linear part, so maybe only using it to fine tune the linear part after features are learned?
https://github.com/riccardodv/MirrorRL/blob/b7830390561630ca33fc8c4563d4ec45895a28a2/cascade_mirror_rl_fqi.py#L69-L72
It seems like this piece of code corresponds more to SARSA method as we use next actions in the trajectory in order to compute new target q-value. Do we have reasons to use SARSA as a way to fit approximation of Q?
Should we try other approaches such as Fitted Q iteration, LSTD or DQN? Do you know what to choose in which situation?