Closed vwxyzjn closed 1 year ago
I will look into it
running into this:
26996 [False] [0] [26991] [False]
I20230326 05:36:29.981667 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26992; max_episode_steps=27000
26997 [False] [0] [26992] [False]
I20230326 05:36:29.990259 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26993; max_episode_steps=27000
26998 [False] [0] [26993] [False]
I20230326 05:36:29.999145 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26994; max_episode_steps=27000
26999 [False] [0] [26994] [False]
I20230326 05:36:30.008062 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26995; max_episode_steps=27000
27000 [False] [0] [26995] [False]
I20230326 05:36:30.016428 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26996; max_episode_steps=27000
27001 [False] [0] [26996] [False]
I20230326 05:36:30.024385 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26997; max_episode_steps=27000
27002 [False] [0] [26997] [False]
I20230326 05:36:30.033180 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26998; max_episode_steps=27000
27003 [False] [0] [26998] [False]
I20230326 05:36:30.042129 47267 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26999; max_episode_steps=27000
27004 [False] [0] [26999] [False]
I20230326 05:36:30.052059 47267 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=27000; max_episode_steps=27000
27005 [False] [0] [27000] [ True]
I20230326 05:36:30.091991 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=0; max_episode_steps=27000
27006 [False] [0] [0] [False]
I20230326 05:36:30.101820 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=1; max_episode_steps=27000
27007 [False] [0] [1] [False]
I20230326 05:36:30.110059 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=2; max_episode_steps=27000
27008 [False] [0] [2] [False]
I20230326 05:36:30.117316 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=3; max_episode_steps=27000
27009 [False] [0] [3] [False]
I20230326 05:36:30.125382 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=4; max_episode_steps=27000
27010 [False] [0] [4] [False]
I20230326 05:36:30.133920 47267 atari_env.h:234] discount=1; lives=5; game_over=0; done=0; elapsed_step=5; max_episode_steps=27000
27011 [False] [0] [5] [False]
the environment is supposed to be done by now
eval_episode=0, episodic_return=359780.0
I20230326 06:10:13.587806 71402 atari_env.h:234] discount=1; lives=8; game_over=0; done=0; elapsed_step=3838; max_episode_steps=27000
I20230326 06:10:13.587837 71402 env.h:193] in Allocate: current_step: 3838 done: 0 max_episode_steps: 27000
3838 [False] [0] [3838] [False]
I20230326 06:10:13.594794 71402 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=3839; max_episode_steps=27000
I20230326 06:10:13.594839 71402 env.h:193] in Allocate: current_step: 3839 done: 1 max_episode_steps: 27000
3839 [False] [0] [3839] [ True]
I20230326 06:10:13.608574 71402 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=3839; max_episode_steps=27000
I20230326 06:10:13.608611 71402 env.h:193] in Allocate: current_step: 0 done: 0 max_episode_steps: 27000
3840 [False] [0] [3839] [False]
I20230326 06:10:13.616739 71402 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=3840; max_episode_steps=27000
I20230326 06:10:13.616784 71402 env.h:193] in Allocate: current_step: 1 done: 0 max_episode_steps: 27000
3841 [False] [0] [3840] [False]
I20230326 06:10:13.624781 71402 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=3841; max_episode_steps=27000
I20230326 06:10:13.624830 71402 env.h:193] in Allocate: current_step: 2 done: 0 max_episode_steps: 27000
3842 [False] [0] [3841] [False]
I know the issue, it's because if you set episodic_life, the counter in env.h (base class) doesn't depend on real
episode step; instead, it depends on per-life episode step.
after fix:
3837 [False] [0] [3837] [False]
I20230326 06:20:50.503247 81434 atari_env.h:234] discount=1; lives=8; game_over=0; done=0; elapsed_step=3838; max_episode_steps=27000
I20230326 06:20:50.503301 81434 env.h:193] in Allocate: current_step: 3838 done: 0 max_episode_steps: 27000
3838 [False] [0] [3838] [False]
I20230326 06:20:50.512998 81434 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=3839; max_episode_steps=27000
I20230326 06:20:50.513052 81434 env.h:193] in Allocate: current_step: 3839 done: 1 max_episode_steps: 27000
3839 [False] [0] [3839] [ True]
I20230326 06:20:50.530524 81434 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=3839; max_episode_steps=27000
I20230326 06:20:50.530570 81434 env.h:193] in Allocate: current_step: 0 done: 0 max_episode_steps: 27000
3840 [False] [0] [3839] [False]
this still holds, because it losses one life (8 to 7), but it's neither terminated nor truncated
@vwxyzjn should I set terminated=True in ^?
27002 [False] [0] [26997] [False]
I20230326 06:27:29.944105 2613 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26998; max_episode_steps=27000
I20230326 06:27:29.944160 2613 env.h:193] in Allocate: current_step: 1210 done: 0 max_episode_steps: 27000
27003 [False] [0] [26998] [False]
I20230326 06:27:29.953630 2613 atari_env.h:234] discount=1; lives=7; game_over=0; done=0; elapsed_step=26999; max_episode_steps=27000
I20230326 06:27:29.953689 2613 env.h:193] in Allocate: current_step: 1211 done: 0 max_episode_steps: 27000
27004 [False] [0] [26999] [False]
I20230326 06:27:29.963445 2613 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=27000; max_episode_steps=27000
I20230326 06:27:29.963516 2613 env.h:193] in Allocate: current_step: 1212 done: 1 max_episode_steps: 27000
27005 [ True] [0] [27000] [ True]
eval_episode=0, episodic_return=359780.0
@Trinkle23897, thanks for looking into this! I think terminated
should not be true because the env is truncated.
https://github.com/openai/gym/issues/3102 all over again
the above uses gym==0.23, using gym==0.26.2 gives:
3837 term=array([False]) trunc=array([False]) [3837] False
I20230327 05:49:26.849589 27255 atari_env.h:234] discount=1; lives=8; game_over=0; done=0; elapsed_step=3838; max_episode_steps=27000
I20230327 05:49:26.849642 27255 env.h:193] in Allocate: current_step: 3838 done: 0 max_episode_steps: 27000
3838 term=array([False]) trunc=array([False]) [3838] False
I20230327 05:49:26.858686 27255 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=3839; max_episode_steps=27000
I20230327 05:49:26.858741 27255 env.h:193] in Allocate: current_step: 3839 done: 1 max_episode_steps: 27000
3839 term=array([ True]) trunc=array([False]) [3839] True
eval_episode=0, episodic_return=51160.0
which is not the expected behavior.
including infos
:
I20230327 05:52:50.942430 30022 atari_env.h:234] discount=1; lives=8; game_over=0; done=0; elapsed_step=3838; max_episode_steps=27000
I20230327 05:52:50.942497 30022 env.h:193] in Allocate: current_step: 3838 done: 0 max_episode_steps: 27000
3838 term=array([False]) trunc=array([False]) [3838] False {'env_id': array([0], dtype=int32), 'lives': array([8], dtype=int32), 'players': {'env_id': array([0], dtype=int32)}, 'reward': array([0.], dtype=float32), 'terminated': array([0], dtype=int32), 'elapsed_step': array([3838], dtype=int32)}
I20230327 05:52:50.951848 30022 atari_env.h:234] discount=0; lives=7; game_over=0; done=1; elapsed_step=3839; max_episode_steps=27000
I20230327 05:52:50.951895 30022 env.h:193] in Allocate: current_step: 3839 done: 1 max_episode_steps: 27000
3839 term=array([ True]) trunc=array([False]) [3839] True {'env_id': array([0], dtype=int32), 'lives': array([7], dtype=int32), 'players': {'env_id': array([0], dtype=int32)}, 'reward': array([0.], dtype=float32), 'terminated': array([0], dtype=int32), 'elapsed_step': array([3839], dtype=int32)}
eval_episode=0, episodic_return=51160.0
@vwxyzjn suggested returning term=True
but info["terminated"] = False
, which is the same as above behavior.
Just wanted to leave a working snippet here (although this is without a learned policy)
import envpool
import numpy as np
num_envs = 1
envs = envpool.make(
"UpNDown-v5",
env_type="gym",
num_envs=num_envs,
episodic_life=True, # Espeholt et al., 2018, Tab. G.1
repeat_action_probability=0, # Hessel et al., 2022 (Muesli) Tab. 10
noop_max=30, # Espeholt et al., 2018, Tab. C.1 "Up to 30 no-ops at the beginning of each episode."
full_action_space=False, # Espeholt et al., 2018, Appendix G., "Following related work, experts use game-specific action sets."
max_episode_steps=int(108000 / 4), # Hessel et al. 2018 (Rainbow DQN), Table 3, Max frames per episode
reward_clip=True,
seed=1,
)
next_obs = envs.reset()
step = 0
done = False
episodic_return = []
for i in range(125000):
step += 1
next_obs, _, d, infos = envs.step(np.random.randint(0, envs.action_space.n, num_envs))
episodic_return += infos["reward"][0]
done = sum(infos["terminated"]) + sum(infos["TimeLimit.truncated"]) >= 1
print('step', step, 'trunc', infos['TimeLimit.truncated'], 'term', infos["terminated"], infos["elapsed_step"], 'done', d)
if step > int(108000 / 4) + 10:
print("the environment is supposed to be done by now")
break
print(f"eval_episode={len(episodic_return)}, episodic_return={episodic_return}")
Output
step 403 trunc [False] term [0] [399] done [False]
step 404 trunc [False] term [0] [400] done [False]
step 405 trunc [False] term [0] [401] done [False]
step 406 trunc [False] term [0] [402] done [False]
step 407 trunc [False] term [0] [403] done [False]
step 408 trunc [False] term [0] [404] done [False]
step 409 trunc [False] term [1] [405] done [ True]
eval_episode=0, episodic_return=[]
Describe the bug
It is possible for envpool to return
done=True
, butinfos['TimeLimit.truncated']
andinfos["terminated"]
return `False.To Reproduce
See the following snippet and download the model. The code is roughly
As shown,
done=True
, but it reports that the environment is neither terminated nor truncated. This is a bug because the environment is in fact truncated.Expected behavior
infos['TimeLimit.truncated']
should beTrue
at the 27000 step.Screenshots
If applicable, add screenshots to help explain your problem.
System info
Describe the characteristic of your environment:
Additional context
Might be related to #179
Reason and Possible fixes
If you know or suspect the reason for this bug, paste the code lines and suggest modifications.
Checklist