ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.3k stars 5.63k forks source link

Reporting Reward Breakdowns #7518

Open gauravg11 opened 4 years ago

gauravg11 commented 4 years ago

I have a custom environment where the total reward is the sum of intrinsic reward and environmental reward.

I've configured the environment to emit the reward breakdowns as: info = {'agent0' : {'intrinsic' : X, 'environmental' : Y} ... }

and then defining a custom callback as below

def on_postprocess_traj(info):
    infos = info['post_batch']['infos']
    intrinsic_vals = [f['intrinsic'] for f in infos]
    env_vals = [f['environmental'] for f in infos]

    episode = info['episode']
    episode.custom_metrics["envs"] = sum(env_vals)
    episode.custom_metrics["intrinsics"] = sum(intrinsic_vals)

However I'm struggling with getting my custom metrics to match up with metrics such as episode_reward_mean. What would be the right way here to record reward breakdowns?

ericl commented 4 years ago

This might be because some episodes aren't finished when on_postprocess_traj is called, so you are computing rewards on partial episodes instead. Try setting the "batch_mode": "complete_episodes" config, which will force complete trajectories to be generated.

gauravg11 commented 4 years ago

While this definitely made a large effect towards bringing the metrics together, there's still some discrepancy as far as order of magnitude goes.

Result for A3C_cleanup_env_0:
  custom_metrics:
    totals_max: -155
    totals_mean: -948.6603773584906
    totals_min: -1787
  date: 2020-03-12_17-45-43
  done: false
  episode_len_mean: 1000.0
  episode_reward_max: -1882.0
  episode_reward_mean: -5024.641509433963
  episode_reward_min: -8642.0
  episodes_this_iter: 106
  episodes_total: 106
...
  timesteps_since_restore: 100000
  timesteps_this_iter: 100000
  timesteps_total: 100000
  training_iteration: 1