discount_rewards logic not clear

mrahtz / tensorflow-rl-pong

Pong AI trained using policy gradient-based reinforcement learning

51 stars 21 forks source link

discount_rewards logic not clear #1

Open parikshitag opened 5 years ago

parikshitag commented 5 years ago

Hi, I am really stuck at the discount_rewards function. Can you explain the logic behind discount_rewards function. It seems its updating the rewards in forward direction

mrahtz commented 5 years ago

Yeah, the return at time t should be the sum of rewards from time t onwards, with increasing discount for larger t. If this is confusing, check out http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf

If it still seems strange, take a look at https://github.com/mrahtz/ocd-a3c/blob/cca2c036113e0acf75132521639b1825d1080083/utils.py#L11 which does the same thing but backwards and produces the same result.

parikshitag commented 5 years ago

Thanks mrahtz! This make sense although the utils.py also implementing the immediate future rewards I was earlier updating rewards backwards as given in Karparthy's blog

def discount_rewards(rewards, discount_factor):
    discounted_rewards = np.zeros_like(rewards)
    running_add = 0
    for t in reversed(range(0, len(rewards))):
        if rewards[t] != 0:
            running_add = 0
        running_add = running_add * discount_factor + rewards[t]
        discounted_rewards[t] = running_add
    return discounted_rewards