sail-sg / envpool

C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
https://envpool.readthedocs.io
Apache License 2.0
1.09k stars 100 forks source link

[BUG] `done=True`, but `infos['TimeLimit.truncated']` and `infos["terminated"]` are `False` #239

Closed vwxyzjn closed 1 year ago

vwxyzjn commented 1 year ago

Describe the bug

It is possible for envpool to return done=True, but infos['TimeLimit.truncated'] and infos["terminated"] return `False.

To Reproduce

See the following snippet and download the model. The code is roughly

envs = envpool.make(
    "UpNDown-v5",
    env_type="gym",
    num_envs=1,
    episodic_life=True,  # Espeholt et al., 2018, Tab. G.1
    repeat_action_probability=0,  # Hessel et al., 2022 (Muesli) Tab. 10
    noop_max=30,  # Espeholt et al., 2018, Tab. C.1 "Up to 30 no-ops at the beginning of each episode."
    full_action_space=False,  # Espeholt et al., 2018, Appendix G., "Following related work, experts use game-specific action sets."
    max_episode_steps=int(108000 / 4),  # Hessel et al. 2018 (Rainbow DQN), Table 3, Max frames per episode
    reward_clip=True,
    seed=1,
)
next_obs = envs.reset()
step = 0
done = False
while not done:
    step += 1
    actions, key = get_action_and_value(network_params, actor_params, next_obs, key)
    next_obs, _, d, infos = envs.step(np.array(actions))
    episodic_return += infos["reward"][0]
    done = sum(infos["terminated"]) + sum(infos["TimeLimit.truncated"]) >= 1
    print(step, infos['TimeLimit.truncated'], infos["terminated"], infos["elapsed_step"], d)

    if step > int(108000 / 4) + 10:
        print("the environment is supposed to be done by now")
        break
print(f"eval_episode={len(episodic_returns)}, episodic_return={episodic_return}")
27000 [False] [0] [26997] [False]
27001 [False] [0] [26998] [False]
27002 [False] [0] [26999] [False]
27003 [False] [0] [27000] [ True]
27004 [False] [0] [0] [False]
27005 [False] [0] [1] [False]
27006 [False] [0] [2] [False]
27007 [False] [0] [3] [False]
27008 [False] [0] [4] [False]
27009 [False] [0] [5] [False]
27010 [False] [0] [6] [False]
27011 [False] [0] [7] [False]
the environment is supposed to be done by now
eval_episode=0, episodic_return=367860.0

As shown, done=True, but it reports that the environment is neither terminated nor truncated. This is a bug because the environment is in fact truncated.

Expected behavior

infos['TimeLimit.truncated'] should be True at the 27000 step.

Screenshots

If applicable, add screenshots to help explain your problem.

System info

Describe the characteristic of your environment:

import envpool, numpy, sys
print(envpool.__version__, numpy.__version__, sys.version, sys.platform)
0.8.1 1.21.6 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] linux

Additional context

Might be related to #179

Reason and Possible fixes

If you know or suspect the reason for this bug, paste the code lines and suggest modifications.

Checklist

Markus28 commented 1 year ago

I will look into it

Trinkle23897 commented 1 year ago

running into this:

26996 [False] [0] [26991] [False]
I20230326 05:36:29.981667 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26992; max_episode_steps=27000
26997 [False] [0] [26992] [False]
I20230326 05:36:29.990259 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26993; max_episode_steps=27000
26998 [False] [0] [26993] [False]
I20230326 05:36:29.999145 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26994; max_episode_steps=27000
26999 [False] [0] [26994] [False]
I20230326 05:36:30.008062 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26995; max_episode_steps=27000
27000 [False] [0] [26995] [False]
I20230326 05:36:30.016428 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26996; max_episode_steps=27000
27001 [False] [0] [26996] [False]
I20230326 05:36:30.024385 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26997; max_episode_steps=27000
27002 [False] [0] [26997] [False]
I20230326 05:36:30.033180 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26998; max_episode_steps=27000
27003 [False] [0] [26998] [False]
I20230326 05:36:30.042129 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26999; max_episode_steps=27000
27004 [False] [0] [26999] [False]
I20230326 05:36:30.052059 47267 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=27000; max_episode_steps=27000
27005 [False] [0] [27000] [ True]
I20230326 05:36:30.091991 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=0; max_episode_steps=27000
27006 [False] [0] [0] [False]
I20230326 05:36:30.101820 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=1; max_episode_steps=27000
27007 [False] [0] [1] [False]
I20230326 05:36:30.110059 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=2; max_episode_steps=27000
27008 [False] [0] [2] [False]
I20230326 05:36:30.117316 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=3; max_episode_steps=27000
27009 [False] [0] [3] [False]
I20230326 05:36:30.125382 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=4; max_episode_steps=27000
27010 [False] [0] [4] [False]
I20230326 05:36:30.133920 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=5; max_episode_steps=27000
27011 [False] [0] [5] [False]
the environment is supposed to be done by now
eval_episode=0, episodic_return=359780.0
Trinkle23897 commented 1 year ago
I20230326 06:10:13.587806 71402 atari_env.h:234] discount=1; lives=8; game_over=0; done=0; elapsed_step=3838; max_episode_steps=27000
I20230326 06:10:13.587837 71402 env.h:193] in Allocate: current_step: 3838 done: 0 max_episode_steps: 27000
3838 [False] [0] [3838] [False]
I20230326 06:10:13.594794 71402 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=3839; max_episode_steps=27000
I20230326 06:10:13.594839 71402 env.h:193] in Allocate: current_step: 3839 done: 1 max_episode_steps: 27000
3839 [False] [0] [3839] [ True]
I20230326 06:10:13.608574 71402 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=3839; max_episode_steps=27000
I20230326 06:10:13.608611 71402 env.h:193] in Allocate: current_step: 0 done: 0 max_episode_steps: 27000
3840 [False] [0] [3839] [False]
I20230326 06:10:13.616739 71402 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=3840; max_episode_steps=27000
I20230326 06:10:13.616784 71402 env.h:193] in Allocate: current_step: 1 done: 0 max_episode_steps: 27000
3841 [False] [0] [3840] [False]
I20230326 06:10:13.624781 71402 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=3841; max_episode_steps=27000
I20230326 06:10:13.624830 71402 env.h:193] in Allocate: current_step: 2 done: 0 max_episode_steps: 27000
3842 [False] [0] [3841] [False]
Trinkle23897 commented 1 year ago

I know the issue, it's because if you set episodic_life, the counter in env.h (base class) doesn't depend on real episode step; instead, it depends on per-life episode step.

Trinkle23897 commented 1 year ago

after fix:

3837 [False] [0] [3837] [False]
I20230326 06:20:50.503247 81434 atari_env.h:234] discount=1; lives=8; game_over=0; done=0; elapsed_step=3838; max_episode_steps=27000
I20230326 06:20:50.503301 81434 env.h:193] in Allocate: current_step: 3838 done: 0 max_episode_steps: 27000
3838 [False] [0] [3838] [False]
I20230326 06:20:50.512998 81434 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=3839; max_episode_steps=27000
I20230326 06:20:50.513052 81434 env.h:193] in Allocate: current_step: 3839 done: 1 max_episode_steps: 27000
3839 [False] [0] [3839] [ True]
I20230326 06:20:50.530524 81434 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=3839; max_episode_steps=27000
I20230326 06:20:50.530570 81434 env.h:193] in Allocate: current_step: 0 done: 0 max_episode_steps: 27000
3840 [False] [0] [3839] [False]

this still holds, because it losses one life (8 to 7), but it's neither terminated nor truncated

@vwxyzjn should I set terminated=True in ^?

27002 [False] [0] [26997] [False]
I20230326 06:27:29.944105  2613 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26998; max_episode_steps=27000
I20230326 06:27:29.944160  2613 env.h:193] in Allocate: current_step: 1210 done: 0 max_episode_steps: 27000
27003 [False] [0] [26998] [False]
I20230326 06:27:29.953630  2613 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26999; max_episode_steps=27000
I20230326 06:27:29.953689  2613 env.h:193] in Allocate: current_step: 1211 done: 0 max_episode_steps: 27000
27004 [False] [0] [26999] [False]
I20230326 06:27:29.963445  2613 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=27000; max_episode_steps=27000
I20230326 06:27:29.963516  2613 env.h:193] in Allocate: current_step: 1212 done: 1 max_episode_steps: 27000
27005 [ True] [0] [27000] [ True]
eval_episode=0, episodic_return=359780.0
vwxyzjn commented 1 year ago

@Trinkle23897, thanks for looking into this! I think terminated should not be true because the env is truncated.

Benjamin-eecs commented 1 year ago

https://github.com/openai/gym/issues/3102 all over again

Trinkle23897 commented 1 year ago

the above uses gym==0.23, using gym==0.26.2 gives:

3837 term=array([False]) trunc=array([False]) [3837] False
I20230327 05:49:26.849589 27255 atari_env.h:234] discount=1; lives=8; game_over=0; done=0; elapsed_step=3838; max_episode_steps=27000
I20230327 05:49:26.849642 27255 env.h:193] in Allocate: current_step: 3838 done: 0 max_episode_steps: 27000
3838 term=array([False]) trunc=array([False]) [3838] False
I20230327 05:49:26.858686 27255 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=3839; max_episode_steps=27000
I20230327 05:49:26.858741 27255 env.h:193] in Allocate: current_step: 3839 done: 1 max_episode_steps: 27000
3839 term=array([ True]) trunc=array([False]) [3839] True
eval_episode=0, episodic_return=51160.0

which is not the expected behavior.

including infos:

I20230327 05:52:50.942430 30022 atari_env.h:234] discount=1; lives=8; game_over=0; done=0; elapsed_step=3838; max_episode_steps=27000
I20230327 05:52:50.942497 30022 env.h:193] in Allocate: current_step: 3838 done: 0 max_episode_steps: 27000
3838 term=array([False]) trunc=array([False]) [3838] False {'env_id': array([0], dtype=int32), 'lives': array([8], dtype=int32), 'players': {'env_id': array([0], dtype=int32)}, 'reward': array([0.], dtype=float32), 'terminated': array([0], dtype=int32), 'elapsed_step': array([3838], dtype=int32)}
I20230327 05:52:50.951848 30022 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=3839; max_episode_steps=27000
I20230327 05:52:50.951895 30022 env.h:193] in Allocate: current_step: 3839 done: 1 max_episode_steps: 27000
3839 term=array([ True]) trunc=array([False]) [3839] True {'env_id': array([0], dtype=int32), 'lives': array([7], dtype=int32), 'players': {'env_id': array([0], dtype=int32)}, 'reward': array([0.], dtype=float32), 'terminated': array([0], dtype=int32), 'elapsed_step': array([3839], dtype=int32)}
eval_episode=0, episodic_return=51160.0

@vwxyzjn suggested returning term=True but info["terminated"] = False, which is the same as above behavior.

rfali commented 1 year ago

Just wanted to leave a working snippet here (although this is without a learned policy)

import envpool
import numpy as np
num_envs = 1
envs = envpool.make(
    "UpNDown-v5",
    env_type="gym",
    num_envs=num_envs,
    episodic_life=True,  # Espeholt et al., 2018, Tab. G.1
    repeat_action_probability=0,  # Hessel et al., 2022 (Muesli) Tab. 10
    noop_max=30,  # Espeholt et al., 2018, Tab. C.1 "Up to 30 no-ops at the beginning of each episode."
    full_action_space=False,  # Espeholt et al., 2018, Appendix G., "Following related work, experts use game-specific action sets."
    max_episode_steps=int(108000 / 4),  # Hessel et al. 2018 (Rainbow DQN), Table 3, Max frames per episode
    reward_clip=True,
    seed=1,
)
next_obs = envs.reset()
step = 0
done = False
episodic_return = []

for i in range(125000):
    step += 1

    next_obs, _, d, infos = envs.step(np.random.randint(0, envs.action_space.n, num_envs))
    episodic_return += infos["reward"][0]
    done = sum(infos["terminated"]) + sum(infos["TimeLimit.truncated"]) >= 1
    print('step', step, 'trunc', infos['TimeLimit.truncated'], 'term', infos["terminated"], infos["elapsed_step"], 'done', d)

    if step > int(108000 / 4) + 10:
        print("the environment is supposed to be done by now")
        break
print(f"eval_episode={len(episodic_return)}, episodic_return={episodic_return}")

Output

step 403 trunc [False] term [0] [399] done [False]
step 404 trunc [False] term [0] [400] done [False]
step 405 trunc [False] term [0] [401] done [False]
step 406 trunc [False] term [0] [402] done [False]
step 407 trunc [False] term [0] [403] done [False]
step 408 trunc [False] term [0] [404] done [False]
step 409 trunc [False] term [1] [405] done [ True]
eval_episode=0, episodic_return=[]