Closed lepmik closed 4 years ago
Hello,
The returns array is initialized on that line, and is calculated in subsequent lines. Each value in that array represents cumulative sum of rewards from it's time until the end plus bootstrap next_value
.
Hi, sorry, I was fooled by what seems like a numpy bug in zeros_like
and jumped to conclusions without properly checking the code. Anyway, here is the background from my confusion: as you can see, using zeros_like
seems to mess things up
np.zeros_like
uses input array's dtype by default. In the blog post code rewards
has dtype np.float32
, but in your case it's np.int32
. This results in the gamma * returns[t+1]
part being truncated to zero, effectively making returns
array into a copy of rewards
.
To "fix" the issue either use floats or explicitly specify the dtype when defining the rewards array, e.g. rewards = np.array([0., 0., 1., 1., 0., 1.])
.
Ah, my bad entirely then. Thanks for clearing that up.
tor. 11. jun. 2020, 21:51 skrev Roman Ring notifications@github.com:
np.zeros_like uses input array's dtype by default. In the blog post code rewards has dtype np.float32, but in your case it's np.int32. This results in the gamma * returns[t+1] part being truncated to zero, effectively making returns array into a copy of rewards.
To "fix" the issue either use floats or explicitly specify the dtype when defining the rewards array, e.g. rewards = np.array([0., 0., 1., 1., 0., 1.]).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/inoryy/tensorflow2-deep-reinforcement-learning/issues/7#issuecomment-642894580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2CP4A467IRCYD2SDDCBETRWEYTZANCNFSM4N3NHEGQ .
Hi, first of all thank you for your blog post and nice, readable code! I have been using your example to rewrite some other RL implementation from TF1 to TF2. When I was comparing advantages though, I found some differences. It seems to me that your advantage estimations from https://github.com/inoryy/tensorflow2-deep-reinforcement-learning/blob/6689d0e0a77ca084e30167c75cd48d6f3e26f375/a2c.py#L111 should be