ros2 / ros2cli

ROS 2 command line interface tools
Apache License 2.0
164 stars 158 forks source link

ros2 topic list takes 10 minutes in a container (when run first time) #903

Open marekcygan opened 2 months ago

marekcygan commented 2 months ago

Bug report

Listing topics takes several minutes (10-20) when run the first time from the command line. During this time one core is used 100%.

Required Info:

Steps to reproduce issue

docker run -it osrf/ros:rolling-desktop

(inside docker)

ros2 topic list

Expected behavior

Topics listed after a second.

Actual behavior

Command takes 10 minutes.

MichaelOrlov commented 2 months ago

Hi @marekcygan, We did a brainstorming session on our weekly waffle triage meeting about this issue and here is our outcome.

  1. We need more information about the storage driver and about the host system itself.
  2. Could you please try to narrow down the problem by running with --no-daemon? If the issue is gone - likely relates to the issue when the node graph collecting info about other nodes.
  3. This issue might be related to the fix (patch) for the "Meltdown" vulnerability. It was a patches in the stdlib and linux kernel for that and we have seen significant performance degradation on some other platforms.
marekcygan commented 2 months ago

@MichaelOrlov thanks for your attention!

  1. Storage driver: overlay 2 (putting full docker info in a separate comment). Host system:

    ❯ uname -r
    6.8.5-1-MANJARO
  2. I got an error when adding --no-daemon to ros2 topic list:

    root@b18db09cbf50:/# ros2 topic list --no-daemon
    Operation not permitted
    terminate called after throwing an instance of 'std::system_error'
    what():  Invalid argument
    Aborted (core dumped)
marekcygan commented 2 months ago
❯ docker info
Client:
 Version:    25.0.3
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  0.13.0
    Path:     /usr/lib/docker/cli-plugins/docker-buildx
WARNING: Plugin "/home/marek/.docker/cli-plugins/docker-compose" is not valid: failed to fetch metadata: exit status 255

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 27
 Server Version: 25.0.3
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: 7c3aca7a610df76212171d200ca3811ff6096eb8.m
 runc version: 
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.5-1-MANJARO
 Operating System: Manjaro Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.72GiB
 Name: marek-rtx-3070
 ID: B4RK:DIMQ:CKKZ:HOQD:QE3R:4IO6:V6GJ:LCPT:2ORO:W5I7:WM37:SM35
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
marekcygan commented 1 month ago

@MichaelOrlov any ideas what should be the next steps?

MichaelOrlov commented 1 month ago

@nuclearsandwich @wjwwood Any thoughts after providing details about running system configuration?

fujitatomoya commented 1 month ago

@marekcygan

so after,

docker run -it osrf/ros:rolling-desktop

this just hangs up forever,

ros2 topic list

but this generates the permission error?

root@b18db09cbf50:/# ros2 topic list --no-daemon
Operation not permitted
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
Aborted (core dumped)

that is really weird, and probably not related to ROS2...

a couple of things i would check,

### Did you source the ROS2 environment
root@cc8f329115ab:/# source /opt/ros/rolling/setup.bash 

### Check the file and id ownership and permission
root@cc8f329115ab:/# which ros2
/opt/ros/rolling/bin/ros2
root@cc8f329115ab:/# ls -l /opt/ros/rolling/bin/ros2
-rwxr-xr-x 1 root root 955 Feb 16 16:37 /opt/ros/rolling/bin/ros2
root@cc8f329115ab:/# id -a
uid=0(root) gid=0(root) groups=0(root)
marekcygan commented 1 month ago

@fujitatomoya @MichaelOrlov

@marekcygan

so after,

docker run -it osrf/ros:rolling-desktop

this just hangs up forever,

Not forever, it takes 10 minutes to finish.

root@dd4078a0cd6b:/# time ros2 topic list
/parameter_events
/rosout

real    11m27.017s
user    9m53.996s
sys 1m32.153s
ros2 topic list

but this generates the permission error?

It used to, but now it does not, now it prints what it should immediately:

root@dd4078a0cd6b:/# ros2 topic list --no-daemon
/parameter_events
/rosout

a couple of things i would check, Did you source the ROS2 environment

Yes, otherwise I would not be able to run ros2 topic.

Check the file and id ownership and permission root@cc8f329115ab:/# which ros2

I get:

/opt/ros/rolling/bin/ros2

root@cc8f329115ab:/# ls -l /opt/ros/rolling/bin/ros2

-rwxr-xr-x 1 root root 955 Feb 16 16:37 /opt/ros/rolling/bin/ros2

root@cc8f329115ab:/# id -a

uid=0(root) gid=0(root) groups=0(root)
fujitatomoya commented 1 month ago

Not forever, it takes 10 minutes to finish.

can you stop the container and start it up, and the try following?

### login container and then

### check if ros2 daemon is running, expecting not running
ros2 daemon status

### ros2 command, expecting this takes 10 mins
ros2 topic list

### see if ros2 daemon is now running
ros2 daemon status

### ros2 command, to tell the problem is daemon spawning process or XMLRPC traffic.
ros2 topic list

if 2nd ros2 topic list responds quickly, the problem can be spawning process for ros2 daemon on your platform.

It used to, but now it does not, now it prints what it should immediately:

at least, this is relief. thanks for checking.

marekcygan commented 1 month ago

One more piece of information is that I have updated all the manjaro packages last week.

❯ docker --version
Docker version 26.1.1, build 4cf5afaefa
❯ uname -r
6.8.9-3-MANJARO
sgvandijk commented 1 month ago

I also run into this issue on Manjaro. After digging a little bit I found that it gets stuck in this loop: https://github.com/ros2/ros2cli/blob/58b61c98378fa49a4a164450f1d5222bde2e4f50/ros2cli/ros2cli/node/daemon.py#L140-L149

On my system, resource.getrlimit(resource.RLIMIT_NOFILE) returns 1073741816 and it takes a long time to count that high!

However, it looks like a workaround for this has already been created here: https://github.com/ros2/ros2cli/commit/64d216cb8fafef83d046b79ee6294afb06b7c595 which made it into Jazzy.

It would be great if that could be backported to Humble and Iron!

fujitatomoya commented 1 month ago

@sgvandijk thanks for posting the information, i was aware of that issue.

docker run -it osrf/ros:rolling-desktop

original post tells me this happens with rolling, so could be another issue because https://github.com/ros2/ros2cli/pull/888 is available with rolling and jazzy.

It would be great if that could be backported to Humble and Iron!

no objections for this.

MirTITH commented 1 month ago

As a workaround, you can add --ulimit nofile=1024:1048576 to the docker run command:

docker run -it --ulimit nofile=1024:1048576 my-image

Or set default ulimits in /etc/docker/daemon.json:

{
    "default-ulimits": {
        "nofile": {
            "Name": "nofile",
            "Hard": 1048576,
            "Soft": 1024
        }
    }
}

Then restart docker daemon:

sudo systemctl restart docker

Note: These values are based on Ubuntu 22.04.

fujitatomoya commented 1 month ago

It would be great if that could be backported to Humble and Iron!

backports to humble and iron are completed.

i still need to keep this open since original issue came from rolling, we should not meet this problem because https://github.com/ros2/ros2cli/commit/64d216cb8fafef83d046b79ee6294afb06b7c595 has been in rolling for a while.

@marekcygan can you confirm?

MarinoAlpine commented 2 weeks ago

Hello,

I am encountering the same issue with my devcontainer:

I am using ros2 with ros2 intelRealSense wrapper to use their depth cameras. I've run the docker without intelrealsense wrapper and i've still got the problem.

Thanks for your help!

marekcygan commented 1 week ago

The issue no longer exists on my end. Sorry for late reply.

NovoG93 commented 1 week ago

I am facing similar issues on Fedora 40. List commands (e.g ros2 topic list ros2 node list) do not terminate. Only when run with --no-daemon they finish.