openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.66k stars 8.6k forks source link

FreewayDeterministic-v4 is not deterministic #1478

Closed tongzhoumu closed 5 years ago

tongzhoumu commented 5 years ago

I think "Deterministic-v4" is a deterministic version of any atari games. However, I found that FreewayDeterministic-v4 is not deterministic. I execute a fixed action sequence some times but I cannot get exact same observation at each time. The following code can reproduce it. My gym version is 0.12.1, and my python version is 3.5.2

import gym
import copy
import numpy as np

actions = [1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 1, 1, 0, 1, 2, 2, 1, 1, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 2, 2, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0]

final_ob_in_last_game = None
count = 0
while True:
    count += 1
    print('Game', count)
    env = gym.make('FreewayDeterministic-v4')
    env.reset()
    for action in actions:
        ob, reward, done, _ = env.step(action)
    if final_ob_in_last_game is not None and not ((final_ob_in_last_game == ob).all()):
        print(np.nonzero(final_ob_in_last_game - ob))
    final_ob_in_last_game = copy.deepcopy(ob)

Followup: I just find that "FrostbiteDeterministic-v4" is also not deterministic.

import gym
import copy
import numpy as np

game = 'Frostbite'
actions = [17, 4, 3, 6, 8, 9, 3, 5, 16, 4, 13, 12, 15, 6, 9, 11, 14, 2, 15, 8, 8, 8, 2, 17, 10, 14, 1, 11, 4, 11, 7, 11, 16, 9, 0, 8, 7, 9, 5, 9, 4, 5, 13, 11, 14, 15, 6, 8, 15, 6, 0, 17, 11, 10, 4, 5, 7, 14, 0, 10, 16, 5, 13, 0, 15, 4, 15, 8, 9, 11, 12, 16, 2, 7, 4, 17, 0, 2, 5, 6, 14, 3, 17, 13, 1, 0, 5, 11, 10, 16, 12, 5, 12, 11, 3, 2, 6, 9, 4, 8, 14, 8, 14, 7, 15, 10, 13, 6, 9, 11, 10, 0, 16, 6, 3, 16, 3, 0, 1, 4, 12, 6, 1, 0, 6, 15, 14, 11, 12, 0, 12, 4, 6, 15, 8, 16, 16, 4, 10, 10, 10, 7, 8, 11, 10, 7, 5, 14, 10, 6, 10, 12, 11, 4, 1, 10, 9, 16, 10, 4, 6, 3, 0, 10, 11, 7, 5, 17, 0, 2, 15, 6, 12, 6, 16, 14, 11, 16, 1, 16, 8, 8, 16, 1, 6, 9, 2, 3, 0, 14, 4, 16, 12, 1, 0, 11, 17, 9, 15, 9, 0, 1, 13, 8, 4, 0, 0, 10, 4, 15, 10, 5, 1, 2, 1, 7, 4, 17, 0, 16, 9, 0, 3, 9, 17, 11, 8, 14, 17, 11, 17, 8, 6, 12, 9, 14, 7, 14, 3, 7, 14, 15, 15, 5, 16, 13, 0, 0, 10, 17, 11, 15, 10, 6, 13, 13, 4, 6, 3, 17, 3, 1, 12, 8, 2, 10, 17, 11, 3, 9, 3, 4, 1, 13, 4, 1, 14, 16, 2, 4, 2, 2, 11, 3, 9, 2, 17, 0, 1, 8, 11, 6, 5, 11, 6, 0, 12, 3, 5, 9, 14, 10, 17, 14, 2, 0, 9, 2, 13, 9, 14, 6, 8, 9, 1, 14, 3, 8, 2, 1, 12, 2, 3, 4, 0, 3, 7, 16, 5, 16, 3, 2, 1, 1, 4, 8, 4, 7, 14, 9, 15, 1, 17, 11, 0, 11, 16, 3, 15, 1, 10, 6, 15, 7, 16, 7, 11, 10, 2, 5, 16, 6, 13, 14, 15, 13, 14, 7, 1, 9, 2, 4, 0, 16, 11, 16, 14, 17, 17, 9, 1, 15, 4, 16, 16, 6, 9, 8, 6, 11, 11, 7, 1, 16, 9, 15, 6, 12, 14, 2, 3, 1, 11, 5, 13, 12, 15, 1, 2, 8, 5, 1, 1, 1, 9]

final_ob_in_last_game = None
count = 0
while True:
    count += 1
    print('Game', count)
    env = gym.make(game+'Deterministic-v4')
    env.reset()
    for action in actions:
        ob, reward, done, _ = env.step(action)
    print(done)
    if final_ob_in_last_game is not None and not ((final_ob_in_last_game == ob).all()):
        print(np.nonzero(final_ob_in_last_game - ob))
    final_ob_in_last_game = copy.deepcopy(ob)
christopherhesse commented 5 years ago

Do you get the same issue if you use 'FreewayNoFrameskip-v4'?

tongzhoumu commented 5 years ago

Do you get the same issue if you use 'FreewayNoFrameskip-v4'?

Yes, if I manually skip 4 frames, it also happens.

christopherhesse commented 5 years ago

Hmm, why does it only happen if you manually skip 4 frames?

tongzhoumu commented 5 years ago

Hmm, why does it only happen if you manually skip 4 frames?

No, it will happen on both Deterministic-v4 AND NoFrameskip-v4.

For Deterministic-v4, try this:

import gym
import copy
import numpy as np

actions = [1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 1, 1, 0, 1, 2, 2, 1, 1, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 2, 2, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0]

final_ob_in_last_game = None
count = 0
while True:
    count += 1
    print('Game', count)
    env = gym.make('FreewayDeterministic-v4')
    env.reset()
    for action in actions:
        ob, reward, done, _ = env.step(action)
    if final_ob_in_last_game is not None and not ((final_ob_in_last_game == ob).all()):
        print(np.nonzero(final_ob_in_last_game - ob))
    final_ob_in_last_game = copy.deepcopy(ob)

For NoFrameskip-v4, try this:

import gym
import copy
import numpy as np

game = 'Freeway'
actions = [1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 1, 1, 0, 1, 2, 2, 1, 1, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 2, 2, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0]

final_ob_in_last_game = None
count = 0
while True:
    count += 1
    print('Game', count)
    env = gym.make(game+'NoFrameskip-v4')
    env.reset()
    for action in actions:
        for _ in range(4):
            ob, reward, done, _ = env.step(action)
    if final_ob_in_last_game is not None and not ((final_ob_in_last_game == ob).all()):
        # print(np.nonzero(final_ob_in_last_game - ob))
        print('Different observation!')
    final_ob_in_last_game = copy.deepcopy(ob)
christopherhesse commented 5 years ago

Ah, I didn't realize it was so sensitive to the exact action sequence.

It looks like you're not seeding the environment, so it uses a random seed, here's a version that calls seed:

import gym

game = 'Freeway'
actions = [1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 1, 1, 0, 1, 2, 2, 1, 1, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 2, 2, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0]

final_ob_in_last_game = None
count = 0
while True:
    count += 1
    print('Game', count)
    env = gym.make(game+'NoFrameskip-v4')
    env.seed(0)
    env.reset()
    for action in actions:
        for _ in range(4):
            ob, _, _, _ = env.step(action)
    if final_ob_in_last_game is not None and not ((final_ob_in_last_game == ob).all()):
        print('Different observation!')
    final_ob_in_last_game = ob.copy()
    env.close()

Does it still happen if you seed the environment?

tongzhoumu commented 5 years ago

Ah, I didn't realize it was so sensitive to the exact action sequence.

It looks like you're not seeding the environment, so it uses a random seed, here's a version that calls seed:

import gym

game = 'Freeway'
actions = [1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 1, 1, 0, 1, 2, 2, 1, 1, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 2, 2, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0]

final_ob_in_last_game = None
count = 0
while True:
    count += 1
    print('Game', count)
    env = gym.make(game+'NoFrameskip-v4')
    env.seed(0)
    env.reset()
    for action in actions:
        for _ in range(4):
            ob, _, _, _ = env.step(action)
    if final_ob_in_last_game is not None and not ((final_ob_in_last_game == ob).all()):
        print('Different observation!')
    final_ob_in_last_game = ob.copy()
    env.close()

Does it still happen if you seed the environment?

Hi, your code works. However, I find that I need to seed the environment each time before reset(), otherwise it still not entirely deterministic. For example, if I only seed the env once,

import gym

game = 'Freeway'
actions = [1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 1, 2, 2, 1, 1, 2, 1, 0, 2, 1, 1, 0, 1, 2, 2, 1, 1, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 2, 2, 0, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0]

env = gym.make(game+'NoFrameskip-v4')
env.seed(0)
final_ob_in_last_game = None
count = 0
while True:
    count += 1
    print('Game', count)
    env.reset()
    for action in actions:
        for _ in range(4):
            ob, _, _, _ = env.step(action)
    if final_ob_in_last_game is not None and not ((final_ob_in_last_game == ob).all()):
        print('Different observation!')
    final_ob_in_last_game = ob.copy()
    env.close()

Does that mean env.reset() doesn't full reset the environment?

christopherhesse commented 5 years ago

In general no, it doesn't reset the RNG state, so you have to call env.seed(0) before each reset, as you pointed out. This is standard behavior on environments with random initial starting conditions, which apparently includes ALE games in gym.

It's confusing that the "Deterministic" version of an environment is not actually deterministic.

tongzhoumu commented 5 years ago

Got it. Thank you!