Closed Arlaz closed 11 months ago
Hi @Arlaz thanks for the interest! Indeed the doc is misleading, very grateful that you reported this. We're refactoring DQN so we should at least fix the doc.
Your fixes seem sensible to me, we'll integrate them (unless you want to make the PR).
cc @albertbou92 for context
@Arlaz Would you be happy to review #1737? I trained a couple of models and it seems ok on my side
Describe the bug
For an academic project, I wanted to compare few versions of DQN :
By looking into torchrl documentation, I found the
delay_value
argument that it is said to create a target network to create a double DQN. This can mislead a user between trying to implement a simple DQN with a target net or a real DDQN.I may have not really understand all the intricacies of TorchRL, but after digging a bit into TorchRL code, I think that using the
delay_value
does not really create a Double DQN as described in the reference articleReason and Possible fixes
The issue is maybe in the
_next_value
function in advantages.py. The current implementation use the same network (target or value network) to predict the next state value.Instead of:
https://github.com/pytorch/rl/blob/2e7f574529fd4e6bd2f661b0d59bd22623e4fb49/torchrl/objectives/value/advantages.py#L427-L435
I would think of something like :
and in dqn.py :
I did these changes (and few other concerning the keys) and I got results closer to what I can achieve with a DDQN in other libraries.
Am I wrong somewhere ? Please tell me if I can further help or even make a PR if needed. Thank you for this incredible work!
Checklist