Closing the terminal doesn't kill the node in ROS2 Humble running on Docker #721

Closed audrow closed 7 months ago

audrow commented 7 months ago

Copied from issue on ros2/ros2 by @maxkonrad

Also see @fujitatomoya's comment.

Bug report

Required Info:

Steps to reproduce issue

1- I connected to jetson nano host via ssh using my Ubuntu22.04 pc's Terminator terminal. 2- I ran a docker instance with the following Dockerfile

RUN apt-get update && apt-get install -y nano && rm -rf /var/lib/apt/lists/*

COPY config/ /site_config/


RUN groupadd --gid $USER_GID $USERNAME \
  && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
  && mkdir /home/$USERNAME/.config && chown $USER_UID:$USER_GID /home/$USERNAME/.config

RUN apt-get update \
  && apt-get install -y sudo \
  && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME\
  && chmod 0440 /etc/sudoers.d/$USERNAME \
  && rm -rf /var/lib/apt/lists/*

COPY entrypoint.sh /entrypoint.sh
COPY bashrc /home/$USERNAME/.bashrc

COPY /my_py_pkg /src/my_py_pkg

ENTRYPOINT [ "/bin/bash", "/entrypoint.sh" ]

CMD ["bash"] 

3- There are two std_msgs.msg int64 publishers I am using on the my_py_pkg python package, one of them publishes to /number_count topic and both of them publishes to /number topic. (idk if two nodes publishing to one topic is a problem)

4- Close the terminal or change the network.

Expected behavior

I expected running nodes to kill.

Actual behavior

Running nodes show when I run ros2 node list but when I run ros2 lifecycle set <topic name> shutdown it returns Node not found on terminal. I don't know if node is alive or not.

Additional information

Screenshot from 2024-02-13 14-24-03

audrow commented 7 months ago

Also, from some discussion at our weekly triage meeting, it seems like the issue may be in how signals are handled by the entry point.

mikaelarguedas commented 7 months ago

Thanks for reporting. @maxkonrad @audrow

ros:humble repo (using FROM ros:humble command in Dockerfile) it seems like the issue may be in how signals are handled by the entry point.

Is it possible to provide a reproducible example by providing a full dockerfile and other files copied in the container (e.g. the entrypoint.sh)?

Can you reproduce the issue with a vanilla ros:humble image without extra custom configs and files ?

maxkonrad commented 7 months ago

I will try to reproduce with both again today and share the process I followed sorry for late answer I was busy these days.

fujitatomoya commented 7 months ago

1st of all, container is still running state after close the terminal? in other word, how did you start the docker e.g docker run xxx? can you provide the all options. if container is daemonized, it should be running the application after killing the ssh session.

a couple of more questions.


maxkonrad commented 7 months ago

set -e

source /opt/ros/humble/setup.bash

echo "Provided arguments: $@"

exec $@


source /opt/ros/humble/setup.bash
source /usr/share/colcon_argcomplete/hook/colcon-argcomplete.bash


FROM osrf/ros:humble-desktop-full

RUN apt-get update && apt-get install -y nano && rm -rf /var/lib/apt/lists/*

COPY config/ /site_config/


# Creating a non-root user
RUN groupadd --gid $USER_GID $USERNAME \
  && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
  && mkdir /home/$USERNAME/.config && chown $USER_UID:$USER_GID /home/$USERNAME/.config

# Set-up sudo
RUN apt-get update \
  && apt-get install -y sudo \
  && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME\
  && chmod 0440 /etc/sudoers.d/$USERNAME \
  && rm -rf /var/lib/apt/lists/*

COPY entrypoint.sh /entrypoint.sh
COPY bashrc /home/$USERNAME/.bashrc

COPY /my_py_pkg /src/my_py_pkg
ENTRYPOINT [ "/bin/bash", "/entrypoint.sh" ]

CMD ["bash"]

my_py_pkg simply contains basic number publisher and subscriber scripts to test connection. build command: sudo docker image build -t jetson_docker .

run command: sudo docker run -it --user ros --network=host --ipc=host -v $PWD/source:/my_py_pkg jetson_docker

!!! important -> I am connected to jetson via ssh and closed the terminal on my host.

I will try to reproduce the issue again with ros/humble image in a few minutes

maxkonrad commented 7 months ago

docker running code: sudo docker run -it --user ros --network=host --ipc=host -v $PWD/source:/my_py_pkg <img_name>

yes they are on the same network

I will try again today to reproduce the issue, again as you said: maybe it takes time to un-discover because of ssh connection or docker??

maxkonrad commented 7 months ago

I quickly prepared a video for this link to youtube video

fujitatomoya commented 7 months ago

maybe it takes time to un-discover because of ssh connection or docker??

besides this, can you check that container status with docker ps -a? i think the container is supposed to be exited status after closing the terminal.

maxkonrad commented 7 months ago

No, actually I only close one instance of docker terminal I created with exec command. Docker container still runs. @fujitatomoya

maxkonrad commented 7 months ago

And also I just realized I wasn't using osrf's desktop image on jetson (besides there is no arm image for osrf ros2 desktop afaik) I, by mistake copied the wrong code from private repo, there only FROM command and all its line should be changed to FROM ros:humble. I know that makes it irrelevant to osrf and it is about ros maybe I should move this issue again. Sorry again for mistake. @audrow

after corrections the Dockerfile should be as the following:

FROM ros:humble

RUN apt-get update && apt-get install -y nano && rm -rf /var/lib/apt/lists/*

COPY config/ /site_config/


# Creating a non-root user
RUN groupadd --gid $USER_GID $USERNAME \
  && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
  && mkdir /home/$USERNAME/.config && chown $USER_UID:$USER_GID /home/$USERNAME/.config

# Set-up sudo
RUN apt-get update \
  && apt-get install -y sudo \
  && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME\
  && chmod 0440 /etc/sudoers.d/$USERNAME \
  && rm -rf /var/lib/apt/lists/*

COPY entrypoint.sh /entrypoint.sh
COPY bashrc /home/$USERNAME/.bashrc

ENTRYPOINT [ "/bin/bash", "/entrypoint.sh" ]

CMD ["bash"]

Sorry again for the mistake I am new to software and open source world :(

fujitatomoya commented 7 months ago

i can reproduce this issue on my env without ros. i say current work-around is to make sure exit the process spawned by docker exec before closing the terminal, that also said this is the issue for docker but ROS.

### start container
tomoyafujita@~/DVT/work >docker run -it --network=host --ipc=host test
Provided arguments: bash
root@tomoyafujita:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 22:30 pts/0    00:00:00 bash
root          46       1  0 22:30 pts/0    00:00:00 ps -ef

### start another session
tomoyafujita@~/DVT >docker exec -it c1922deefec2 /bin/bash
root@tomoyafujita:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 22:30 pts/0    00:00:00 bash
root          47       0  0 22:31 pts/1    00:00:00 /bin/bash
root          54      47  0 22:31 pts/1    00:00:00 ps -ef

### closing terminal without exit
root@tomoyafujita:/# sleep 60

root@tomoyafujita:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 22:30 pts/0    00:00:00 bash
root          47       0  0 22:31 pts/1    00:00:00 /bin/bash
root          56      47  0 22:32 pts/1    00:00:00 sleep 60
root          57       1  0 22:32 pts/0    00:00:00 ps -ef

### give it 60 seconds
root@tomoyafujita:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 22:30 pts/0    00:00:00 bash
root          47       0  0 22:31 pts/1    00:00:00 /bin/bash
root          59       1  0 22:32 pts/0    00:00:00 ps -ef

the problem is PID 47, still alive that is why child process sleep (this can be ros2 command) was alive for 60 seconds until cyclic expires.

https://github.com/moby/moby/issues/9098 seems related.

maxkonrad commented 7 months ago

@fujitatomoya thanks so much, I think mods can close this issue then.

tfoote commented 7 months ago

Yeah, looks like an upstream issue with exec. Closing here.