ConnectionResetError: [Errno 104] Connection reset by peer

rl-2 commented 2 years ago

Hello,

I'm trying to train a PPO agent with Stable Baselines, followed by the instructions on Sec 5.2.2. After running ./TrainAndTestOpenAIStableBaselines.sh within_template, I got the following error:

Traceback (most recent call last):
  File "OpenAI_StableBaseline_Train.py", line 231, in <module>
    range(c.num_worker)])
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 111, in __init__
    observation_space, action_space = self.remotes[0].recv()
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

I wonder if I miss a step to activate the ScienceBird application? Please let me know.

Thank you!

Cheng-Xue commented 2 years ago

Hi Rodger, please try the new version and let me know if the issue persists. Thanks.

rl-2 commented 2 years ago

Hi Cheng, it seems the issue is still there. Here is a full log:

Error in client-server communication: [Errno 111] Connection refused
Process ForkServerProcess-20:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 24, in _worker
    env = env_fn_wrapper.var()
  File "/home/ubuntu/RL-AngryBirds/sciencebirdsagents/Utils/utils.py", line 64, in _init
    max_attempts_per_level=max_attempts_per_level)
  File "/home/ubuntu/RL-AngryBirds/sciencebirdsagents/SBEnvironment/SBEnvironmentWrapperOpenAI.py", line 78, in __init__
    self.connect_agent_to_server()
  File "/home/ubuntu/RL-AngryBirds/sciencebirdsagents/SBEnvironment/SBEnvironmentWrapperOpenAI.py", line 88, in connect_agent_to_server
    self.ar.configure(self.env_id)
  File "/home/ubuntu/RL-AngryBirds/sciencebirdsagents/Client/agent_client.py", line 171, in configure
    self.playing_mode.value
  File "/home/ubuntu/RL-AngryBirds/sciencebirdsagents/Client/agent_client.py", line 131, in _send_command
    self.server_socket.sendall(msg)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "OpenAI_StableBaseline_Train.py", line 231, in <module>
    range(c.num_worker)])
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 111, in __init__
    observation_space, action_space = self.remotes[0].recv()
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/ubuntu/anaconda3/envs/pytorch_p37/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

rl-2 commented 2 years ago

To follow up on this issue, I initialized the game server before running the script and I got the similar issue:

021-11-30 00:57:35,012 - OpenAI stable baselines Training and Testing - INFO - training step: 0
Server started...
Error in client-server communication: [Errno 111] Connection refused

On the server side, it seems it has been killed automatically:

The Science Birds Server is waiting for the first agent to connect
Waiting for agent  
Killed

Cheng-Xue commented 2 years ago

Hi Rodger, the problem should still be that the game server is not successfully initialised. Can you provide the exact environment you are using so that we can replicate the issue? Thanks.

rl-2 commented 2 years ago

Thanks, Cheng. Below is the environments info:

Ubuntu: 18.04.6 LTS
Python: 3.7.10
Numpy: 1.18.5
Torch: 1.10.0
Torchvision: 0.8.2
lxml: 4.6.3
tensorboard: 2.7.0
Java: 13.0.4
stable-baselines: 1.3.0

And the steps I've taken are:

Run java -jar ./game_playing_interface.jar and the terminal shows:

The Science Birds Server is waiting for the first agent to connect
Waiting for agent

Run ./TrainAndTestOpenAIStableBaselines.sh within_template. Then I got the errors shown in this thread.

Cheng-Xue commented 2 years ago

Hi Luo, I have updated a version. The new version will open a new terminal window to run the server. Please let me know if the problem still exist. Cheers.

rl-2 commented 2 years ago

Hi Cheng,

Thanks a ton for the update! I saw this error when I run the code:

sh: 1: gnome-terminal: not found

Note that I'm running the code on an AWS instance. I'm not sure it prevents launching a new terminal window?

Cheng-Xue commented 2 years ago

Hi Rodger, it is a bit tricky to run on AWS, although we did our test on AWS as well, it only supports 'symbolic' mode atm. The initial version (you can activate it by setting self.headless_server = True at line 10 in Server.py.

Can you please verify if the following code can successfully run start the server?

bash -c "cd ../sciencebirdsgames/Linux && nohup java -jar ./game_playing_interface.jar --headless --dev > out 2>&1 &"

hawe66 commented 7 months ago

I also have a question regarding server.py.
You used 3 conditions; self.if_head, self.headless_server, self.state_repr_type.

--dev > out 2>&1 option is added in line 22, 33, 43, 52 (when self.headless_server==True).
Isn't this option correspond to self.state_repr_type?
--headless option is added in line 22, 27, 43, 47 (when self.if_head==False and self.state_repr_type=='symbolic or when self.if_head=='headless').
This obviously looks like wrong code, since you didn't add self.state_repr_type condition later on (i.e. elif and else).
Also, I don't get why you added similarly functioning conditions self.if_head and self.headless_server.
Can you explain me about this?

Cheng-Xue commented 6 months ago

I also have a question regarding server.py. You used 3 conditions; self.if_head, self.headless_server, self.state_repr_type.

--dev > out 2>&1 option is added in line 22, 33, 43, 52 (when self.headless_server==True). Isn't this option correspond to self.state_repr_type?

--headless option is added in line 22, 27, 43, 47 (when self.if_head==False and self.state_repr_type=='symbolic or when self.if_head=='headless'). This obviously looks like wrong code, since you didn't add self.state_repr_type condition later on (i.e. elif and else). Also, I don't get why you added similarly functioning conditions self.if_head and self.headless_server. Can you explain me about this?

Hi Hawe,

Apologies for the delay in getting back to you.

Regarding your questions:

The addition of --dev > out 2>&1 corresponds to the use of symbolic states. But when the image representation is used, the agent will not read from the symbolic states, so adding --dev will not alter the result.

When self.state_repr_type == "symbolic", the agent requests symbolic state representation from the server. The presence of --dev ensures accurate information retrieval. Conversely, when self.state_repr_type != "symbolic", the agent doesn't engage with symbolic representation and requests only the images.

Regarding the presence of both self.headless_server and self.if_head, it was an issue during our code refactoring. We are planning to integrate the Java server directly into Unity for improved usability without additional configurations. We're committed to addressing these concerns and improving code readability in our next release.

Please let me know if you have future questions or would like more clarifications.

Cheers, Cheng

phy-q / benchmark

ConnectionResetError: [Errno 104] Connection reset by peer #4