Hi, first of all: very clean implementation of these algorithms in Pytorch, much much appreciated!!
After reading through the code a bit I think there might be a small error in the code for doubledqn:
According to the Paper in the link (Double DQN, page 4):
"We therefore propose to evaluate the greedy policy according to the online network, but using the target network to estimate its value. ... In comparison to Double Q-learning (4), the weights of the second network are replaced with the weights of the target network for the evaluation of the current greedy policy."_
So I updated a few lines of code to make sure that the max-Q index is chosen using the current_model!
I ran 50 Cartpole-v0 runs for each version and the updated version seems to converge slightly faster..
Hi, first of all: very clean implementation of these algorithms in Pytorch, much much appreciated!!
After reading through the code a bit I think there might be a small error in the code for doubledqn: According to the Paper in the link (Double DQN, page 4): "We therefore propose to evaluate the greedy policy according to the online network, but using the target network to estimate its value. ... In comparison to Double Q-learning (4), the weights of the second network are replaced with the weights of the target network for the evaluation of the current greedy policy."_
So I updated a few lines of code to make sure that the max-Q index is chosen using the current_model! I ran 50 Cartpole-v0 runs for each version and the updated version seems to converge slightly faster..
Cheers :)