openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.85k stars 4.88k forks source link

Is evaluation proper? [concerns] #117

Closed Breakend closed 7 years ago

Breakend commented 7 years ago

Hi,

I have some concerns on the evaluations used throughout this code repository.

PPO/TRPO

The evaluation uses the same trajectories as for training. However, you have a horizon which could be a partial trajectory. However, if you do evaluation on these partial trajectories, you may be skewing your results for the worse. For example, say we have a batch size of 2048 (as is the default with mujoco). And we have two trajectories of 1000 timesteps and then a partial trajectory of 48 timesteps, it looks like our average returns will be much lower than they actually are since we're using the partial trajectory in our evaluation.

[EDIT] Nvm, I realized I missed something, it's a bit confusing, but returns only get added here, but partial trajectories can be returned but are only used for the training and not considered in evaluation. https://github.com/openai/baselines/blob/master/baselines/pposgd/pposgd_simple.py#L56-L58

However, that still leaves the question of evaluations are based on an unspecified number of trajectories and then are mixed together with trajectories from past policies since it's a running average of the past 100 trajectories. This is unlike the original TRPO evaluation code it seems, but i guess this is up for debate.

DDPG @matthiasplappert

First, the algorithm seems a bit off, you collect 100 time steps, then train for 40 timesteps and then evaluate for another 100 and then loop this 20 times. This has the feel of n-step returns without making use of n-step returns. As I understand it originally DDPG should take a small number of steps (ideally one) and use the replay buffer. Second, you're actually evaluating across at least 10 different policies it seems for one trajectory.

https://github.com/openai/baselines/blob/master/baselines/ddpg/training.py#L122-L139

This seems very off. I think the proper thing would be to move the evaluation loop outside of the training loop and do N proper full trajectories with a single policy per epoch. And again, here you may not be evaluating the full trajectory as in PPO.

Maybe I'm misunderstanding the codebase due to the MPI details, but it seems like overall the evaluations done are not proper and can either beneficially or detrimentally change results (even with the 100 average return running mean). Please let me know if I'm incorrect in my assumptions!

ViktorM commented 7 years ago

@Breakend may be I missunderstood your question but what's wrong with rollout and train steps? You can collect experience for n steps and train for k and it's totally ok.

As about evaluation you can train without evaluation at all. It doesn't affect training, how it can change any result?

Breakend commented 7 years ago

Sorry if I wasn't clear. For rollout and train steps, I guess this was more of an observation about the difference from the original DDPG paper. If you're taking N-steps, why not just use n-step backups (asynchronous updates?). It shouldn't really affect the results, but it's a difference from the original paper i think. (this is not the main concern)

The main concern with the DDPG evaluations is this. If people publish a paper using the evaluation as is, I'm not sure if it's proper. Taking rollouts which use a mixture of policies doesn't seem correct and could result in improper reporting. Let's look at an example.

Case 1:

If we take 100 steps and run training iterations on these 100 steps (and our experience replay). Now, my policy knows very well how to handle this 100 step state sequence since we've just trained on it (and whatever is in our replay buffer). So we can take these same 100 steps on the evaluation policy and move on since we know exactly how to handle this state space sequence (since we've just seen it). This is what is currently implemented (it seems).

Case 2:

If, however, we reset our evaluation environment and run 5 full rollouts, you may see the case that the policy forgets how to handle the previous 100 steps or doesn't yet know how to handle the rest of the policy so it'll fail.

In Case 1 you've skewed your results in your favour by evaluating in sections instead of a single policy on a full rollout. You avoid the possible problem of catastrophic forgetting (if it is in fact there). By using a mixture of policies that were just trained on the sequence that you're about to evaluate on, you're not really evaluating the policy.

Does that make sense? I'm not sure if I'm right in my assumptions here, but you can see the difference if you look at the rllab implementation of DDPG.

ViktorM commented 7 years ago

Thanks for more detail description. I'm travelling and replying from my phone, will be able to take a look into the rllab implentation details and refresh my knoweledge of the baselines DDPG only at evening. I'll give at the moment only a brief comment to 1) and I hope I remember all the implementation details correctly.

So after 100 rollouts and 20 training steps the algorithm doesn't know well how to handle 100 rollout steps. There were just added to the replay buffer. And during training batches are randomly sampled from it. When you have a size of relay buffer even 100K the chance that you will get a noticeable number of those 100 rollouts sampling 64x20 transitions are quite small.

Breakend commented 7 years ago

That may be true in later epochs (that it doesn't matter), but in earlier epochs this is certainly a large factor. For the first 3 epochs or so (with the default settings), you have a greater than random chance of selecting a sample you just saw. While, at the end of training perhaps this is not true so it doesn't matter so much, it still seems off.

Why not just evaluate on full evaluation trajectories? What's happening now is evaluation is actually done using a mixture of old policies across a single trajectory. Maybe it doesn't affect evaluation results, or maybe it does. Either way, it seems off.

joschu commented 7 years ago

It's the right way to evaluate in the sense of total accumulated reward (as in online learning / regret minimization) where you have to balance exploration and exploitation.