Define & Implement Stopping Criteria for Training

arvindthiagarajan commented 3 years ago

We currently compute TD Error during training, and a loss term based on this, and both get logged (to TrainingHistory and Tensorboard, respectively).

We need some criteria for when our model's performance is good enough, criteria that we can then use to determine when to stop training.

This criteria should roughly capture both the model performance and the expected marginal improvement (after another iteration of training) under the current model parameters. The criteria should define an upper bound for the former and a lower bound for the latter, beyond which training will stop. Whether both conditions or just one need be satisfied is something we should probably experiment with.

The samples (memories) we use to compute the relevant statistics may be

the same ones we train with
a subset of the ones we train with
newly generated ones

Part of this task is deciding on the a good strategy for this (which statistics we care about, and the way in which we sample memories and compute the statistics on these memories) and writing up your rationale for that strategy. Ideally, all contributors will have to sign off on this.

The two core quantities that we ultimately care about are the TD Error and the expected rewards we get from using our model. That's probably a good thing to keep in mind as you work on this.

ldoshi commented 3 years ago

Notes from discussion:

Concern: Any approach based on a limited set of states (eg, popular, recently visited) risks having gaps because the number of environments and states is very large. We may find and reduce errors for a basket of states while no environment is able to be solved. We are worried about overestimating the quality of the model.

Guiding Idea for Metric: A singular bad performance on an environment is more ok than a bad performance on one brick for all environments.

Current Plan:

Run k validation episodes where k >> total brick count to complete a bridge in an episode
Compute a metric per episode, potentially per brick, reflecting variance in valuation, eg
- TD error
- Are two actions close enough to invert if we learned more (ie, are two actions within one of their TD errors)?

We can potentially set a threshold for when a validation run passes or fails. Approaches like patience for early stopping can be applied to this metric, either on the pass fail criterion or directly on the metric itself.

Lightning has hooks for validation functions and early stopping that can be used to implement this.

The metric feels like it could work. If we can provide any basis for why this metric is good, that would be a plus.

ldoshi commented 3 years ago

After consulting with some experienced people, I'm planning the implement the following. That should get us going and then we can experiment and tweak as we go.

The metric we actually care about is the episode reward, not the the loss. It's relatively cheap for us to generate new rewards. At each validation step, we can run n full episodes with the current policy so we work with only on-policy data and can compute the average reward across the n episodes.

We can do this at each validation step and see how the reward changes. Apparently it may or may not be the case that the reward will quite level out. Let's see and we can tweak as needed. I was told sometimes you actually eyeball a chart and then realize you peaked m iterations ago and use the model from that iteration as your final policy.

Catastrophic forgetting may come up too -- I'm creating a ticket about that.

arvindthiagarajan commented 3 years ago

I guess two questions:

If we have good reward performance but our TD errors are large, are we comfortable stopping? It sounds like this is saying we are - naively this feels off, though the most I could say is that this would make our model brittle to new starting states
How do you pick your threshold reward here? Or are you saying it's a (discrete) derivative of the reward over time that you care about?

ldoshi commented 3 years ago

I was told is the TD Error is not a super great metric of your success -- in some cases, it can be very close between choices so it's doubtful that the idea of training till the best action is "TD Error" dominant over others may either be impractical or effectively impossible. Also, TD error in the end is not what we actually want the policy to do well in. It will be reward/bridge score (depending on how we handle 'creativity' etc). This seems different in RL vs SL. Re: brittle -- let's see. In theory, our n episodes span a breadth of starting states so we'll capture some of this if the TD Error if effectively confusing the policy. We will almost surely have to iterate this though.
Threshold reward to stop at? I had asked if we could wait till reward levels out -- ie the policy isn't getting better (even if it might not be a good policy because the model hyperparameters/architecture are inferior for this run). It sounded like the reward may bounce around a bit or even decline again and so in practice, there may be a manual (!!!) step of looking these reward curves and choosing which model policy (as a function of time) was doing the best. This sounded a little hand-wavey but it was from someone who does this as their job. I'm hoping we can build some refinement here once we have a base impl in place.

ldoshi commented 3 years ago

I'll also transpose more of the discussion here later so we can use that as a basis for additional discussion.

ldoshi commented 3 years ago

I should be able to push a PR for review and discussion as soon as I can add some testing.

One thing I noticed is that if a val_reward gets (un)lucky and is a bit high (where high is good), then that will effectively start the timer for early stopping. This is because the next several many val_rewards will be lower than that blip. In this case, the training hasn't actually converged yet.

It's plausible larger validation batch and patience values will provide some protection here. I posit we can iteration and revisit once we have a better handle on the model. This may not actually be a problem in practice when the relevant parameters have reasonable values.

ldoshi / rome-wasnt-built-in-a-day

Define & Implement Stopping Criteria for Training #23