openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
10.13k stars 2.22k forks source link

Simple policy gradient: mean over episodes? #59

Open neighthan opened 5 years ago

neighthan commented 5 years ago

In the simple policy gradient implementation here, all of the observations, actions, and rewards for one "epoch" (potentially consisting of multiple episodes) are gathered into the same list). These are then all passed into the MLP that parametrizes the policy. Then we have this line

loss = -tf.reduce_mean(weights_ph * log_probs)

where log_probs is just a 1D vector of the probabilities of all actions in this epoch (regardless of which episode they were from). But in the page in the introduction, we have this equation

image

It seems like the implementation takes the mean over the number of elements in both sums (the number of total actions) instead of just over the number of elements in the first sum (the number of episodes). Am I misunderstanding this or is this an error? Is it just implemented like this for simplicity? (since scaling the loss by a constant factor will just scale the gradient but not change the direction) It's also clearly nicer to store all observations, etc. in single vectors because there will probably be a different number of steps in different episodes, so we can't, e.g., have a matrix of vectors of actions for each episode without padding. But we could, e.g., pass in the number of episodes directly to the model and then even if all actions, etc. are in single vectors, we could still get the proper result shown in the equation above by summing across the vector and then dividing by the number of episodes given.

Thanks for your help understanding this!

sun2009ban commented 5 years ago

@neighthan I am also confused about this. According to the equation, I think it should be

loss = - tf.reduce_sum(weights_ph * log_probs) / number_of_trajectories_in_the_epoch

In which the number_of_trajectories_in_the_epoch can be calculated by counting how many times the code run into the if done part

   if done:
      number_of_trajectories_in_the_epoch += 1
      # if episode is over, record info about episode
      ep_ret, ep_len = sum(ep_rews), len(ep_rews)
      batch_rets.append(ep_ret)
      batch_lens.append(ep_len) ...

I have tried my understanding in the code but the performance is downgraded. I hope someone will help me understand this. I really appreciate it.

axitkhurana commented 5 years ago

My understanding is:

(since scaling the loss by a constant factor will just scale the gradient but not change the direction)

sounds right, since we don't care about the loss itself, we only care about parameters that minimize the loss. Parameters that minimize -tf.reduce_mean(weights_ph * log_probs) will also minimize

(since the difference is just the constant multiplied as mentioned above)

Would be good to know if this was chosen because of simplicity or some other reason.

brendenpetersen commented 5 years ago

As others have mentioned, the two methods have the same optima, since there's only a multiplicative scalar offset.

However, I wanted to point out that many notations of policy gradient use an expectation over actions sampled from the current policy and states sampled from the environment (under the current policy). In this case, the mean is over number of steps, not number of trajectories. The samples don't necessarily even need to come from a trajectory at all, e.g. you could downsample from your trajectories to reduce temporal correlations before sending them to the gradient.

I think the stranger issue is that the batch is only checked for completion when done is True. Most PG implementations instead define a rollout of a fixed length (measured in number of steps). Whether or not the rollout consists of only a portion of an episode or spans multiple episodes isn't relevant (as long as returns are still tracked according to their respective episode). Under the current implementation, batches can have both a variable number of episodes and a variable number of steps. The number of steps could become extremely large in a setting like MuJoCo with a well-trained agent that generates extremely long episodes.

For non-vanilla PG methods, this issue actually becomes more complicated, since we are estimating values/advantages. The original PPO paper mentions that the rollout length T should be "much less than the episode length" (apparently quoting the A3C paper, but I don't see this mentioned there?), though I see this violated all the time in practice.

EDIT: Also noticed that the PPO introduction (see Algorithm 1) divides by the number of trajectories multiplied by the trajectory length (fixed in this case, since rollouts are of fixed length), i.e. total number of steps.

EDIT 2: This paper mentions that the number of independent samples from the policy is the number of trajectories (not steps). That doesn't really answer the original question, but it's related to the idea of steps vs episodes vs fixed-length trajectories. Relevant quote from the paper:

"While the total number of samples (state-action pairs) may seem quite large, the number of independent samples used at each iteration corresponds to the number of complete trajectories. This quantity (a) tends to be much lower than the number of state-action pairs, and (b) varies across iterations during training."

wumo commented 5 years ago

Agreed. But for beginners, we'd better not confuse them. Just add

num_episodes_ph = tf.placeholder(shape=(), dtype=tf.float32)
...
loss = -tf.reduce_sum(weights_ph * log_probs) / num_episodes_ph
...
feed_dict={
                               num_episodes_ph: len(batch_rets),