Allow policy access to the environment

For now, I can think of two ways:

Add a callback on episode start that calls a custom method (maybe set_current_env) to set the environment the policy is using. Callbacks are passed through the config (config["callbacks"]["on_episode_start"]), but we can add then in MAPOTrainer so that they're added every time (DQN does this). The callback is called with the following arguments:
```
        callbacks["on_episode_start"]({
            "env": base_env,
            "policy": policies,
            "episode": episode,
        })
```

Create a separate environment in the policy using config["env"] and config["env_config"]. This can be problematic if the environment has some hidden internal state, since in that case the instance used for calculation transitions and the one used for training might behave differently. Nevertheless, we would probably do something similar to what Trainer does:


@override(Trainable)
def _setup(self, config):
    env = self._env_id
    if env:
        config["env"] = env
        if _global_registry.contains(ENV_CREATOR, env):
            self.env_creator = _global_registry.get(ENV_CREATOR, env)
        else:
            import gym  # soft dependency
            self.env_creator = lambda env_config: gym.make(env)
    else:
        self.env_creator = lambda env_config: None



_Originally posted by @angelolovatto in https://github.com/thiagopbueno/model-aware-policy-optimization/pull/50/review_comment/create_

thiagopbueno / model-aware-policy-optimization

Allow policy access to the environment #54