hunkim / ReinforcementZeroToAll

249 stars 132 forks source link

Need to Improve Discounted Reward #1

Closed kkweon closed 7 years ago

kkweon commented 7 years ago

Issue

I came to notice that the current discount reward function is not summing up the future rewards. I'm not sure if it's intended, but even if it's intended, the policy gradient will not behave as intended because it will focus on learning the very first action of each episode

Recall that the Discounted Reward Function is discount r

Example

Implementation in this repo

def discount_rewards(r, gamma=0.99):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [1, 0.99, 0.9801] -> [1.22 -0.004 -1.22]
    """
    d_rewards = np.array([val * (gamma ** i) for i, val in enumerate(r)])

    # Normalize/standardize rewards
    d_rewards -= d_rewards.mean()
    d_rewards /= d_rewards.std()
    return d_rewards

Correct Implementation (from Kapathy's code)

def discount_correct_rewards(r, gamma=0.99):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(range(0, r.size)):
    #if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add

  discounted_r -= discounted_r.mean()
  discounted_r /- discounted_r.std()
  return discounted_r

With Latex

Therefore, the above function should change as below

discount_reward_2

hunkim commented 7 years ago

Good catch. For lab 08, we are experimenting with several reward functions.

Also do you have any comments on:

random_noise = np.random.uniform(0, 1, output_size)
        action = np.argmax(action_prob + random_noise)

I guess this is perhaps correct implementation:

action = np.argmax(np.random.multinomial(n=1, pvals=action_prob, size=1)[0])

I really need help on PG. Please give us more comments. Thanks in advance.

kkweon commented 7 years ago

If it's a policy gradient, the agent should follow the given policy distribution, and it shouldn't just follow argmax at least when it's training.

In the policy gradient agent for CartPole case, a single action should be chosen as following:

actions = [0, 1] # suppose there are two discrete actions
action_prob = [0.7, 0.3] # distribution given from policy network
action = np.random.choice(actions, size=1, p=action_prob)

I didn't actually run the file yet but for problems like cartpole, it's always the case that derivative free model or any simpler model will outperform policy gradient methods. So, I wouldn't be surprised if it's actually doing worse. Though I have to check if other implementations are correct. Will let you know!

kkweon commented 7 years ago

Today I tested the above code.

It turns out there was a problems with numpy.dtype in the above code. The default was set to numpy.int

The correct implementation of discount rewards should be:

def discount_correct_rewards(r, gamma=0.99):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r, dtype=np.float32)
    running_add = 0
    for t in reversed(range(len(r))):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add

    # discounted_r -= discounted_r.mean()
    # discounted_r /- discounted_r.std()
    return discounted_r

It works well.

The reason why the original implementation works? It's due to the normalization factor. It kinda has similar effects. That's why people love normalization I guess lol.

However, it should work without the normalization. The correct implementation will work always with the normalization or without the normalization.

Suggestion

hunkim commented 7 years ago

Please feel free to fix/send PR.

In addition, could you also fix the max 200 limit for cart pole in QN and previous examples?

Thanks in advance!

Androbin commented 6 years ago

Just noticed a fatal typo: discounted_r /- discounted_r.std()

Please update for future readers: discounted_r /= discounted_r.std()