[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method.

Describe the bug The code suddenly reaches an EOF error when calling the step method after 12M steps of training.

Code example I am using gym.vector.AsyncVectorEnv(). I use the function make_envto create my environments.

def make_env(gym_id, seed, idx, capture_video, run_name, qubits, depth):

    def thunk():
        env = gym.make(gym_id, qubits=qubits, depth=depth, env_id=idx)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video and idx == 0:
            env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
        return env

    return thunk

The main part of the code is as follows:

if __name__ == "__main__":
    mp.set_start_method('spawn')
    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
    envs = gym.vector.AsyncVectorEnv(
        [make_env(args.gym_id, args.seed + i, i, args.capture_video, run_name, qubits, depth) for i in range(args.num_envs)],
    shared_memory=False)
    agent = AgentGNN(envs, device).to(device)#Graph Neural Network
    for update in range(1, num_updates + 1):
        for step in range(args.num_steps):  
            global_step += 1 * args.num_envs
            dones[step] = next_done
            try:
                with torch.no_grad():
                    action, logprob, _, value, logits, action_ids = agent.get_action_and_value(next_obs_graph, device=device)
                    values[step] = value.flatten()
                actions[step] = action
                logprobs[step] = logprob

                next_obs, reward, done, deprecated, info = envs.step(action_ids.cpu().numpy()) 
            except TypeError as e:
                print(f"Error: {e}")
            rewards[step] = torch.tensor(reward).to(device).view(-1)

            next_done = torch.Tensor(done).to(device)

As far as I understand the error, this code generates as much threads as environments I want. In one particular thread , the agent breaks in env.step(). As you can see, I tried to solve this issue with a try-except, but this does not work. I think this can be because the thread just keeps on hold until it breaks but I am not sure.

Traceback

Traceback (most recent call last):
  File "/home/jriu/Copt-cquere/rl-zx/ppo.py", line 204, in <module>
    next_obs, reward, done, deprecated, info = envs.step(action_ids.cpu().numpy())
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/vector_env.py", line 137, in step
    return self.step_wait()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 320, in step_wait
    result, success = pipe.recv()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py:457: UserWarning: WARN: Calling `close` while waiting for a pending call to `step` to complete.
Exception ignored in: <function AsyncVectorEnv.__del__ at 0x7ea18eb856c0>
Traceback (most recent call last):
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 546, in __del__
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/vector_env.py", line 205, in close
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 461, in close_extras
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 320, in step_wait
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 250, in recv
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
EOFError:

System Info I use gym 0.26.2, torch 2.0.1 and python 3.10.14. I am using Ubuntu 24.04 LTS. All of the packages were installed using pip.

Additional context Add any other context about the problem here.

Checklist

[X] I have checked that there is no similar issue in the repo (required)

openai / gym

[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method. #3281

Checklist