ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

Inaccurate samples from on_sample_end() #4809

Closed rl-2 closed 5 years ago

rl-2 commented 5 years ago

System information

Describe the problem

Algorithm: APEX_DDPG Environment: Pendulum-v0

I'm trying to use on_sample_end() to retrieve all the transition data: [obs, action, reward, obs_next, done]. For each episode, the dones should be all False except the last transition. Namely, each episode should only has one True. However, I've noticed the number of Trues actually relates to n_step. Specifically, when batch_mode is "complete_episodes", the number of Trues in the end of each episode equals the value of n_step. When batch_mode is "truncate_episodes", the number of Trues randomly jumps between 0 and the value of n_step.

Source code / logs

The code to see the number of Trues:

def on_sample_end(info):

    samples = info["samples"]
    dones = samples.columns(["dones"])
    count_true = 0   
    for i in dones[0]:
        if i == True:
            count_true += 1
    print(count_true)
ericl commented 5 years ago

This is to be expected given how the n_step postprocessing works. At the end of the trajectory, the n is truncated to fit into the trajectory length, so you might actually see multiple steps with done=True but with rewards summed over n, n-1, n-2 etc. steps instead of n.

If you need to access the data prior to postprocessing, we could add a callback that runs on each trajectory fragment prior to postprocessing?

rl-2 commented 5 years ago

Would this also affect other parts of transitions, like rewards, obs? If yes, it would be great to have a callback prior to post-processing as the accuracy of those transitions are crucial for some of our studies. Thanks!

rl-2 commented 5 years ago

It also looks like the samples.count returns inaccurate number under truncate_episodes mode, 50 vs 4001 in my case where 4001 is the maximum steps of one episode in my defined environment.

ericl commented 5 years ago

It's important to note that batches do not correspond to episodes. They in general can consist of one or more episode fragments produced by the sampler, so such behaviour is expected.

On Fri, May 24, 2019, 12:51 AM RodgerLuo notifications@github.com wrote:

It also looks like the samples.count returns inaccurate number under truncate_episodes mode, 50 vs 4001 in my case where 4001 is the maximum steps of one episode in my defined environment.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/4809?email_source=notifications&email_token=AAADUSUKTTLMAZOOJZZCZFLPW6NAZA5CNFSM4HNR4J52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWEODLA#issuecomment-495509932, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSVTOLXJ3GIEAGHTJODPW6NAZANCNFSM4HNR4J5Q .

ericl commented 5 years ago

@RodgerLuo can you try out https://github.com/ray-project/ray/pull/4871 ?