openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.47k stars 8.59k forks source link

[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method. #3281

Open jng164 opened 2 months ago

jng164 commented 2 months ago

Describe the bug The code suddenly reaches an EOF error when calling the step method after 12M steps of training.

Code example I am using gym.vector.AsyncVectorEnv(). I use the function make_envto create my environments.

def make_env(gym_id, seed, idx, capture_video, run_name, qubits, depth):

    def thunk():
        env = gym.make(gym_id, qubits=qubits, depth=depth, env_id=idx)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video and idx == 0:
            env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
        return env

    return thunk

The main part of the code is as follows:

if __name__ == "__main__":
    mp.set_start_method('spawn')
    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
    envs = gym.vector.AsyncVectorEnv(
        [make_env(args.gym_id, args.seed + i, i, args.capture_video, run_name, qubits, depth) for i in range(args.num_envs)],
    shared_memory=False)
    agent = AgentGNN(envs, device).to(device)#Graph Neural Network
    for update in range(1, num_updates + 1):
        for step in range(args.num_steps):  
            global_step += 1 * args.num_envs
            dones[step] = next_done
            try:
                with torch.no_grad():
                    action, logprob, _, value, logits, action_ids = agent.get_action_and_value(next_obs_graph, device=device)
                    values[step] = value.flatten()
                actions[step] = action
                logprobs[step] = logprob

                next_obs, reward, done, deprecated, info = envs.step(action_ids.cpu().numpy()) 
            except TypeError as e:
                print(f"Error: {e}")
            rewards[step] = torch.tensor(reward).to(device).view(-1)

            next_done = torch.Tensor(done).to(device)

As far as I understand the error, this code generates as much threads as environments I want. In one particular thread , the agent breaks in env.step(). As you can see, I tried to solve this issue with a try-except, but this does not work. I think this can be because the thread just keeps on hold until it breaks but I am not sure.

Traceback

Traceback (most recent call last):
  File "/home/jriu/Copt-cquere/rl-zx/ppo.py", line 204, in <module>
    next_obs, reward, done, deprecated, info = envs.step(action_ids.cpu().numpy())
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/vector_env.py", line 137, in step
    return self.step_wait()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 320, in step_wait
    result, success = pipe.recv()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py:457: UserWarning: WARN: Calling `close` while waiting for a pending call to `step` to complete.
Exception ignored in: <function AsyncVectorEnv.__del__ at 0x7ea18eb856c0>
Traceback (most recent call last):
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 546, in __del__
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/vector_env.py", line 205, in close
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 461, in close_extras
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 320, in step_wait
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 250, in recv
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
EOFError: 

System Info I use gym 0.26.2, torch 2.0.1 and python 3.10.14. I am using Ubuntu 24.04 LTS. All of the packages were installed using pip.

Additional context Add any other context about the problem here.

Checklist

Fengwenhao01 commented 1 month ago

l have the same problem.