openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.2k stars 8.58k forks source link

Deepcopy env not working as expected #3254

Open wjessup opened 6 months ago

wjessup commented 6 months ago

Here's an attempt at being able to replay with a different action. If the step terminates, use a copy of the environment and use a different action.

env = gym.make("CartPole-v1")
state, info = env.reset()

for step in count():
    action = random.randint(0, 1)
    prev_env = copy.deepcopy(env)
    next_state, reward, terminated, truncated, _ = env.step(action)
    print("state = ", prev_env.unwrapped.state)
    if terminated:
       other_action = 1 if action == 0 else 0
       next_state, reward, terminated, truncated, _ = prev_env.step(other_action)

Two problems: 1) calling prev_env.step() causes a WARN:

WARN: You are calling 'step()' even though this environment has already returned terminated = True. You should always call 'reset()' once you receive 'terminated = True' -- any further steps are undefined behavior.

2) print("state = ", prev_env.unwrapped.state) will print out the NEXT STATE, meaning it's internal state changed even when .step() was called on env not prev_env.

Any help is appreciated!

pseudo-rnd-thoughts commented 6 months ago

This is an interesting research question, the problem is you have a bug and an incorrect assumption

The bug is after you take the alternative action then you need to modify you env to use the prev_env as this is the new next state that you actually want

The incorrect assumption is that if action X fails, then if I revert and take action Y then this won't fail. The problem is that in this example for certain states, it is possible that both action X and Y will cause the environment to terminate.

My code

env = gym.make("CartPole-v1")
state, info = env.reset()

for step in range(200):
    action = random.randint(0, 1)
    prev_env = copy.deepcopy(env)

    next_state, reward, terminated, truncated, _ = env.step(action)

    if terminated:
        other_action = 1 if action == 0 else 0

        print(f'terminated, taking alternative action ({other_action})')
        next_state, reward, terminated, truncated, _ = prev_env.step(other_action)
        if terminated:
            print('both actions cause termination, ending')
            break
        else:
            env = prev_env

Repeating this experiment a 1000 times then I didn't find a single case where the alternative action didn't also cause the environment to terminate

wjessup commented 6 months ago

Thanks for looking!

It does seem that all terminated states cannot be replayed with a different action. I wonder why not.

Anyway, can you rewind 2 actions and it won't terminate :)

I'm considering storing these states and rewarding differently. The state is on a boundary where if one action is taken it leads to an unrecoverable termination....

The bug still stands:

I think you are running into the same bug. In my original code I do use prev_env.step() as you suggest. But if you inspect the state of the prev_env, you'll notice it changes even tho you don't call step on it! That was the bug.

So, instead I just store all the actions in a replay memory and make a new environment with the same seed, and then replay those actions until 1 before I terminate.

Try this and you'll find states that you can recover from:

env = gym.make("CartPole-v1")

seed = random.randint(0, 10000)
state, info = env.reset(seed=seed)
actions_taken = []
for step in count():
      action = random.randint(0, 1)
      actions_taken.append(action)

      prev_state = state
      next_state, reward, terminated, truncated, _ = env.step(action)
      state = next_state   

      if terminated or truncated:
          reset_and_replay(actions_taken, seed)
          break

def reset_and_replay(actions_taken, seed):
    state, info = env.reset(seed=seed)
    last_action = actions_taken[-1]
    other_action = 1 if last_action == 0 else 0
    for _ in actions_taken[:-1]:
        env.step(_)

    next_state, reward, terminated, truncated, _ = env.step(other_action)
    state = next_state

    if terminated:
        print("unrecoverable!")
        reset_and_replay(actions_taken[:-1], seed)
    else: 
        print("taking the other action didn't terminate!")
        if not truncated:
            terminated_replay_memory.append(Transition(state, last_action, next_state, terminated))
pseudo-rnd-thoughts commented 6 months ago

I can't replicate the problem that you are talking about using both Gym and Gymnasium What version of Gym are you using? It looks like v0.26 (which I'm using as well)

import gym
import gymnasium

print("Gym")
env = gym.make("CartPole-v1")
env.reset()

print(env.unwrapped.state)
copied_env = copy.deepcopy(env)
env.step(env.action_space.sample())
print(copied_env.unwrapped.state)

print(f'Gymnasium')
env = gymnasium.make("CartPole-v1")
env.reset()

print(env.unwrapped.state)
copied_env = copy.deepcopy(env)
env.step(env.action_space.sample())
print(copied_env.unwrapped.state)
wjessup commented 6 months ago

Here's how to see the issue:

import gymnasium as gym
env = gym.make("CartPole-v1")
env.reset()
copied_env = copy.deepcopy(env)

print(env.unwrapped.state)
print(copied_env.unwrapped.state)

next_state, reward, terminated, truncated, _ = env.step(1)

print(env.unwrapped.state)
print(copied_env.unwrapped.state)
print("next_state, reward, terminated, truncated, _ = ", next_state, reward, terminated, truncated, _)

next_state2, reward2, terminated2, truncated2, _ = copied_env.step(0)

print(env.unwrapped.state)
print(copied_env.unwrapped.state)
print("next_state2, reward2, terminated2, truncated2, _ = ", next_state2, reward2, terminated2, truncated2, _)
print()
print("why?", next_state == next_state2)

Output to look at:

why? [ True False True False]

Why are parts of the next_state equal to the next_state2 which came from a different action on a copied environment?

pseudo-rnd-thoughts commented 6 months ago

Why are parts of the next_state equal to the next_state2 which came from a different action on a copied environment?

Looking at the observation space (https://gymnasium.farama.org/environments/classic_control/cart_pole/#observation-space), all this means is a single action hasn't caused the cart position or angular velocity to change. It is not necessary for an action to affect the whole observation.

But going back to your original post, this is not a bug and deepcopy of CartPole works as expected.

wjessup commented 6 months ago

This has been bugging me.

How can it be that if you go: right, right, right, right, right, right and build up lots of velocity.. your position is same as if you go: right, right, right, right, right, left?

This isn't a new issue: https://github.com/openai/gym/pull/1019

Per discussion with @joschu - both versions are correct. The old one is the vanilla Euler, and the suggested is semi-implicit Euler. While the new one is more stable, we'd rather not change the behavior of the existing environment. If you'd still like to use semi-implicit Euler, could you add a separate environment flag that turns it on (i.e. by default it is off, and a flag can turn it on)?

I'm not sure why the team decided back then to not change the behavior? I'd recommend doing so.

If if changing the behavior to be the semi-implicit by default isn't possible, then at least document that it's the preferred method and how game engines and reality works.