openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.78k stars 4.88k forks source link

How to draw the score curve of PPO? #768

Open AceChuse opened 5 years ago

AceChuse commented 5 years ago

Hello,

I don't know how to draw the score curve PPO which in the paper of PPO? How to deal with the situation when the game is not over but the sample pool is full? In this cause if we end the game, it means that we cannot calculate the score which agent needs to perform more than Horizon (T) interactions with the environment to get, such as Walker2d-v1. But if we don't end the game when the sample pool is full, the number of samples maybe larger than Horizon (T).

I don't know how to deal with this problem. How do you deal with this problem in the PPO experiment? I really care about this. Thanks for your help.

pzhokhov commented 5 years ago

Hi @FrigEgg! Let me repeat the question back to you to make sure if I understood it correctly - you mean how to compute the return of the episode if the episode is long (end of the episode is more than nsteps away)? Short answer is we use value function approximation to estimate the return from the future steps than did not fit in the batch. Slightly longer version is that we do what is called generalized advantage estimation (GAE) in ppo, in which advantage is estimated from a weighted sum of multi-step return terms (and n-step return term contains n actual reward + value function estimate n steps ahead). For more mathematically accurate and less handwavy description please refer to this paper: https://arxiv.org/pdf/1506.02438.pdf

As for score curve of PPO - could you please clarify the question? Normally, we plot the learning curves by recording length and reward of each episode, and then plot the latter as y, and the cumulative sum of the former (so that it becomes number of environment steps seen by the algorithm) - as x (also usually some smoothing is added to reduce noise).

AceChuse commented 5 years ago

I am very sorry, I have not explained my problem clearly enough. What I mean is that isn't it to perform an update after N interactions with the environment using the current strategy in PPO? What I mean N is the steps per epoch. I want to know how to deal with the situation when the steps per epoch is lower than the steps per game needed to end.

For example, if the steps per epoch is 256 and the game is now 256 steps, what should we do? If I use the PPO algorithm we should update the model now. So should I end the game and reset it? If I end the game, agent cannot get the sample after 256 steps. I don't know how to deal with this problem.

About the score curve, is it usually the score you get when you use the training directly? Or stop training at a checkpoint and make multiple evaluations? Such as taking a mean of top5 from 25 evaluations.