jr-robotics / robo-gym

An open source toolkit for Distributed Deep Reinforcement Learning on real and simulated robots.
https://sites.google.com/view/robo-gym
MIT License
428 stars 75 forks source link

Unable to launch robot server within docker container. #81

Open szhaovas opened 5 months ago

szhaovas commented 5 months ago

Hello developers, thank you for maintaining robo-gym!

I have been having troubles running robo-gym inside a docker container. My goal is to run the robo-gym server side from within the container, and run the robo-gym training script on my host machine. However, I cannot seem to launch robot server inside the container, and the application always stalls on the step Starting Robot Server....

I initially thought it to be a docker port problem, but even if I launched both the server and the training script within the same docker container, I still could not launch the server, as shown (test.py in the right pane is simply the Random Agent MiR100 Simulation Environment example in README):

Screenshot 2024-06-07 at 15 24 19

Steps to reproduce

My setup

Additional info

Thanks in advance!

jr-b-reiterer commented 5 months ago

Hi @szhaovas,

have you tried replacing gui=True by gui=False in the environment initialization?

szhaovas commented 4 months ago

Hi @jr-b-reiterer,

Thank you for the reply. Yes, I replaced gui=True with gui=False. The test.py file in the forked repo I shared above contains the test script I was running.

jr-b-reiterer commented 4 months ago

When I test with your image, the behaviour is different: I get past the lines from your screenshot, but then the reset fails. The warning from gym I get there gave me a hint that you are using a too new version of gym, 0.26. robo-gym in the present version is compatible with gym up to 0.21 only because of their API change. (An upgrade of robo-gym is in the works internally.)

Back to your observation: I am using Docker 20.10.21 on Ubuntu 20.04. I am not sure if any difference here could cause the problem. You could test if it is different when you run your test script not in a tmux pane but in a separate terminal that you connect to your running container in addition: docker exec -it <container name> bash

szhaovas commented 4 months ago

Hi @jr-b-reiterer,

I downgraded gym to 0.21, and I am now getting the same error as you. I tried both running docker exec -it <container name> bash and running the test script in a tmux pane, and in both cases, I am no longer stuck at "Starting new robot server", but get an error at reset. Do you know how I might fix the reset error? Thanks!

Screenshot 2024-07-02 at 15 55 56
jr-b-reiterer commented 4 months ago

I am not sure it will fix your issue, but apparently your downgrade of gym was not successful. The passive env checker that outputs the warning in your screenshot does not exist in Gym v0.21, see https://github.com/openai/gym/blob/v0.21.0/gym/utils/passive_env_checker.py vs https://github.com/openai/gym/blob/0.26.0/gym/utils/passive_env_checker.py

szhaovas commented 3 months ago

Update: got it to work with 2 fixes!

  1. The gym version had to be downgraded to 0.18.3 as described in a previous issue.
  2. Some container ports had to be mapped to host ports when launching the container.  - On Linux machines, simply launch container with option --network host  - (Hacky. Please let me know if anyone has a cleaner solution) MacOS didn't support host network mode, so instead I had to map ports specifically for robot server and server manager. What worked for me was docker run --rm -it -p 47000-47100:47000-47100 -p 50100-50200:50100-50200 <image>.    - Within the container, find 3 instances of find_free_port() within <robogym_server_modules>/server_manager/server.py (should be on L69, L75, L78), and give each of these a lower_bound and upper_bound within the range of mapped ports. Make sure they don't overlap, so in my case, I had find_free_port(47000, 47030), find_free_port(47030, 47060), find_free_port(47060, 47100). Now the robogym training script on the host machine can communicate with docker:
Screenshot 2024-08-15 at 18 12 07