minerllabs / minerl

MineRL Competition for Sample Efficient Reinforcement Learning - Python Package
http://minerl.io/docs/
Other
702 stars 155 forks source link

Socket timed out after ~1hour #682

Open kolbytn opened 2 years ago

kolbytn commented 2 years ago

I don't have good reproduction steps yet, but I get the following error after ~0.5~1 hours of running minerl. This occurs both on my local machine and in the provided docker container. I'm using minerl.herobraine.env_specs.human_survival_specs.HumanSurvival. Is this a known issue?

Traceback (most recent call last):
...
  File "/home/knotting/mine-goals/venv/lib/python3.9/site-packages/gym/core.py", line 251, in reset
    return self.env.reset(**kwargs)
  File "/home/knotting/mine-goals/venv/lib/python3.9/site-packages/minerl/env/_singleagent.py", line 22, in reset
    multi_obs = super().reset()
  File "/home/knotting/mine-goals/venv/lib/python3.9/site-packages/minerl/env/_multiagent.py", line 446, in reset
    self._send_mission(self.instances[0], agent_xmls[0], self._get_token(0, ep_uid))  # Master
  File "/home/knotting/mine-goals/venv/lib/python3.9/site-packages/minerl/env/_multiagent.py", line 605, in _send_mission
    reply = comms.recv_message(instance.client_socket)
  File "/home/knotting/mine-goals/venv/lib/python3.9/site-packages/minerl/env/comms.py", line 63, in recv_message
    lengthbuf = recvall(sock, 4)
  File "/home/knotting/mine-goals/venv/lib/python3.9/site-packages/minerl/env/comms.py", line 73, in recvall
    newbuf = sock.recv(count)
socket.timeout: timed out
Miffyli commented 2 years ago

Hmm MineRL is not the most stablest thing ever and occasional crashes are to be expected, but it should not be that common. Have you have checked the memory usage? Sometimes calling reset often causes a memory leak. Getting the error message of the MineRL would help. You can enable better logging by having this in the beginning of your code:

import logging
logging.basicConfig(level=logging.DEBUG)
kolbytn commented 1 year ago

I'm taking a look at this again. I wasn't able to get any additional logs by changing the logging level to debug. Should those be coming from stdout?

There's definitely a memory leak, but the memory doesn't appear to spike or approach the max memory for the system before crashing. I did however increase my episode length from 500 up to 5000, which appears to have fixed the problem. I can make this work for now.