Unable to launch robot server within docker container.

szhaovas commented 5 months ago

Hello developers, thank you for maintaining robo-gym!

I have been having troubles running robo-gym inside a docker container. My goal is to run the robo-gym server side from within the container, and run the robo-gym training script on my host machine. However, I cannot seem to launch robot server inside the container, and the application always stalls on the step Starting Robot Server....

I initially thought it to be a docker port problem, but even if I launched both the server and the training script within the same docker container, I still could not launch the server, as shown (test.py in the right pane is simply the Random Agent MiR100 Simulation Environment example in README):

Steps to reproduce

I have pushed my docker image to dockerhub docker pull szhaovas/robogym_test.
- This image was built from noetic.Dockerfile at my fork of the robo-gym-robot-servers repo. This dockerfile is similar to your original, except it also installs robo-gym for running the training script.
Launch a container with docker run --rm -it szhaovas/robogym_test.
Within the container terminal, run start-server-manager && attach-to-server-manager.
Right click to split the tmux pane, and on the other pane, run python3 test.py.

My setup

Probably doesn't matter, but my host machine is a MacBook Air M2 2023, and its OS is Ventura 13.5.2.
My docker version is 26.1.1.

Additional info

I think the problem is specific with docker only, since I was able to run the same example in a Ubuntu 20.04 ROS Noetic virtual machine.
The line that doesn't return seems to be the self.tmux_srv.new_session line inside ServerManager.new_session().
When running kill-all-robot-servers, it returns the error message error connecting to /tmp/tmux-0/ServerManager (No such file or directory).

Thanks in advance!

jr-b-reiterer commented 5 months ago

Hi @szhaovas,

have you tried replacing gui=True by gui=False in the environment initialization?

szhaovas commented 4 months ago

Hi @jr-b-reiterer,

Thank you for the reply. Yes, I replaced gui=True with gui=False. The test.py file in the forked repo I shared above contains the test script I was running.

jr-b-reiterer commented 4 months ago

When I test with your image, the behaviour is different: I get past the lines from your screenshot, but then the reset fails. The warning from gym I get there gave me a hint that you are using a too new version of gym, 0.26. robo-gym in the present version is compatible with gym up to 0.21 only because of their API change. (An upgrade of robo-gym is in the works internally.)

Back to your observation: I am using Docker 20.10.21 on Ubuntu 20.04. I am not sure if any difference here could cause the problem. You could test if it is different when you run your test script not in a tmux pane but in a separate terminal that you connect to your running container in addition: docker exec -it <container name> bash

szhaovas commented 4 months ago

Hi @jr-b-reiterer,

I downgraded gym to 0.21, and I am now getting the same error as you. I tried both running docker exec -it <container name> bash and running the test script in a tmux pane, and in both cases, I am no longer stuck at "Starting new robot server", but get an error at reset. Do you know how I might fix the reset error? Thanks!

jr-b-reiterer commented 4 months ago

I am not sure it will fix your issue, but apparently your downgrade of gym was not successful. The passive env checker that outputs the warning in your screenshot does not exist in Gym v0.21, see https://github.com/openai/gym/blob/v0.21.0/gym/utils/passive_env_checker.py vs https://github.com/openai/gym/blob/0.26.0/gym/utils/passive_env_checker.py

szhaovas commented 3 months ago

Update: got it to work with 2 fixes!

The gym version had to be downgraded to 0.18.3 as described in a previous issue.
Some container ports had to be mapped to host ports when launching the container. - On Linux machines, simply launch container with option --network host - (Hacky. Please let me know if anyone has a cleaner solution) MacOS didn't support host network mode, so instead I had to map ports specifically for robot server and server manager. What worked for me was docker run --rm -it -p 47000-47100:47000-47100 -p 50100-50200:50100-50200 <image>. - Within the container, find 3 instances of find_free_port() within <robogym_server_modules>/server_manager/server.py (should be on L69, L75, L78), and give each of these a lower_bound and upper_bound within the range of mapped ports. Make sure they don't overlap, so in my case, I had find_free_port(47000, 47030), find_free_port(47030, 47060), find_free_port(47060, 47100). Now the robogym training script on the host machine can communicate with docker:

jr-robotics / robo-gym