vlad17 / mve

MVE: model-based value estimation
Apache License 2.0
10 stars 0 forks source link

Add reward scaling / popart #388

Open vlad17 opened 6 years ago

vlad17 commented 6 years ago

Popart is a method that enables networks to scale across different reward magnitudes by literally rescaling the value output weights to adjust to changes in reward magnitude as it's observed in the environment.

See this paper for details.

Adding popart is a bit nuanced, because it requires a moderately involved implementation. I recommend the following steps:

  1. Create a RewardRenormalizable interface class in memory/ which admits an adjust_scale parameter. It's up to subclasses to implement this in a semantically meaningful way; see the paper examples on how to rescale the final output layers of a neural network for details in the case of a value network that's being a RewardRenormalizable instance.
  2. Add a method to the existing memory.normalization.Normalizer class like subscribe_to_reward_rescale, which adds the caller (who passes a point to its self) to an internal list.
  3. Upon statistics updates (which should probably be online) to the Normalizer, invoke the adjust_scale scale method with the appropriate statistics for every subscriber in the internal list.
  4. Make flags available for activating popart.