tkn-tub / ns3-gym

ns3-gym - The Playground for Reinforcement Learning in Networking Research
GNU General Public License v2.0
521 stars 197 forks source link

python socket recv hanging #25

Open anita-hu opened 4 years ago

anita-hu commented 4 years ago

I wrote a custom environment following your examples. When running training, sometimes the code will get stuck at env.step(action). From the keyboard interrupt, it seems like an issue with the zmq socket? Is there a way to resolve this so that the training does not get stuck?

^CTraceback (most recent call last): File "./train.py", line 22, in ddpg.train(max_epochs=2000) File "/home/sim-user/ns3-gym/scratch/mm1-queue/ddpg_agent.py", line 197, in train next_state, reward, done, info = self.env.step(action) # perform action on env File "/home/sim-user/.local/lib/python3.5/site-packages/ns3gym/ns3env.py", line 401, in step response = self.ns3ZmqBridge.step(action) File "/home/sim-user/.local/lib/python3.5/site-packages/ns3gym/ns3env.py", line 231, in step self.rx_env_state() File "/home/sim-user/.local/lib/python3.5/site-packages/ns3gym/ns3env.py", line 180, in rx_env_state request = self.socket.recv() File "zmq/backend/cython/socket.pyx", line 791, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 827, in zmq.backend.cython.socket.Socket.recv File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc KeyboardInterrupt

anita-hu commented 4 years ago

The issue was resolved by setting debug=True in ns3env.Ns3Env(). Would be great if this could be fixed

chwang1996 commented 4 years ago

The issue was resolved by setting debug=True in ns3env.Ns3Env(). Would be great if this could be fixed

I have the same problem when running training on my custom environment, thanks for your solution. Hope this issue would be resolved.

confifu commented 3 years ago

Setting debug = True does not solve the problem entirely, it shows the error in ns3 simulation script but stays there forever. To handle runtime errors in the simulation script I added self.socket.RCVTIMEO = 100000 right next to this line https://github.com/tkn-tub/ns3-gym/blob/19bfe0a583e641142609939a090a09dfc63a095f/src/opengym/model/ns3gym/ns3gym/ns3env.py#L40

This makes sure that the socket request does not last forever and times out after 100000 millliseconds. The socket request is here https://github.com/tkn-tub/ns3-gym/blob/19bfe0a583e641142609939a090a09dfc63a095f/src/opengym/model/ns3gym/ns3gym/ns3env.py#L180

tim00631 commented 2 years ago

@confifu I have the same issue in Ubuntu 18.04. However, when I use the root user to run my script, the problem doesn't happened. Maybe you could use the root user to exec the simulation script again.