inoryy / tensorflow2-deep-reinforcement-learning

Code accompanying the blog post "Deep Reinforcement Learning with TensorFlow 2.1"
http://inoryy.com/post/tensorflow2-deep-reinforcement-learning/
MIT License
207 stars 50 forks source link

Possible bug #7

Closed lepmik closed 4 years ago

lepmik commented 4 years ago

Hi, first of all thank you for your blog post and nice, readable code! I have been using your example to rewrite some other RL implementation from TF1 to TF2. When I was comparing advantages though, I found some differences. It seems to me that your advantage estimations from https://github.com/inoryy/tensorflow2-deep-reinforcement-learning/blob/6689d0e0a77ca084e30167c75cd48d6f3e26f375/a2c.py#L111 should be

returns = np.append(values, next_value, axis=-1)
inoryy commented 4 years ago

Hello, The returns array is initialized on that line, and is calculated in subsequent lines. Each value in that array represents cumulative sum of rewards from it's time until the end plus bootstrap next_value.

lepmik commented 4 years ago

Hi, sorry, I was fooled by what seems like a numpy bug in zeros_like and jumped to conclusions without properly checking the code. Anyway, here is the background from my confusion: as you can see, using zeros_like seems to mess things up

image

inoryy commented 4 years ago

np.zeros_like uses input array's dtype by default. In the blog post code rewards has dtype np.float32, but in your case it's np.int32. This results in the gamma * returns[t+1] part being truncated to zero, effectively making returns array into a copy of rewards.

To "fix" the issue either use floats or explicitly specify the dtype when defining the rewards array, e.g. rewards = np.array([0., 0., 1., 1., 0., 1.]).

lepmik commented 4 years ago

Ah, my bad entirely then. Thanks for clearing that up.

tor. 11. jun. 2020, 21:51 skrev Roman Ring notifications@github.com:

np.zeros_like uses input array's dtype by default. In the blog post code rewards has dtype np.float32, but in your case it's np.int32. This results in the gamma * returns[t+1] part being truncated to zero, effectively making returns array into a copy of rewards.

To "fix" the issue either use floats or explicitly specify the dtype when defining the rewards array, e.g. rewards = np.array([0., 0., 1., 1., 0., 1.]).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/inoryy/tensorflow2-deep-reinforcement-learning/issues/7#issuecomment-642894580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2CP4A467IRCYD2SDDCBETRWEYTZANCNFSM4N3NHEGQ .