Closed Fassty closed 2 years ago
Yes, there are definitely many ways how the copying/averaging can be done. In the template, I prefer the "separate Network, hard copying" approach, because I find the separate Network easier at first (and also you do not want to serialize it after training), and the hard copying is what DQNs are doing. (But in later papers we will see also the soft EMA; note that "Polyak" averaging is arithmetic average, not the exponential moving average you implemented in the second code above.)
However, thanks for the simplified code, that is definitely better!
Regarding the expansion of the state -- that works fine if the state is a numpy array; I am not sure it is promised in the documentation... (reading through the docs) Oh, it is -- great. Then I am merging it and will rely on it (but I will use the syntax state[np.newaxis]
, which I like better).
Also, just for fun:
In [2]: %timeit np.array([s])
603 ns ± 6.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [3]: %timeit s[np.newaxis]
168 ns ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [4]: %timeit np.asarray(s[np.newaxis])
226 ns ± 2.72 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
A community-work point is yours, BTW.
Regarding the copying of weights. You probably want to have both the policy and target network in the same
Network
class as it's needed to calculate the target Q values for computing the loss. So thecopy_weights_from
should rather look like:or we could provide a method for doing the soft (Polyak) update right away