Nodes missing from `ros2 node list` after relaunch

nielsvd commented 3 years ago

Bug report

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- Foxy binaries
Version or commit hash:
- ros-foxy-navigation2 0.4.5-1focal.20201210.084248
DDS implementation:
- Fast-RTPS (default)
Client library (if applicable):
- n/a

Steps to reproduce issue

1

From the workspace root, launch (e.g.) a TurtleBot3 simulation:

export TURTLEBOT3_MODEL=burger
export GAZEBO_MODEL_PATH=$GAZEBO_MODEL_PATH:$(pwd)/src/turtlebot3/turtlebot3_simulations/turtlebot3_gazebo/models
ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

Then, in a second terminal, launch the navigation:

export TURTLEBOT3_MODEL=burger
ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list:

ros2 node list

Close (ctrl-c) the navigation and the simulation.

2

Relaunch from the same respective terminals, the simulation:

ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

and the navigation:

ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list again (2nd time):

ros2 node list

Close (ctrl-c) the navigation and the simulation. Stop the ros2 daemon:

ros2 daemon stop

3

Relaunch from the same respective terminals, the simulation:

ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

and the navigation:

ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list again (3rd time):

ros2 node list

Expected behavior

The node list should be the same all three times (up to some hash in the /transform_listener_impl_... nodes).

Actual behavior

The second time, the following nodes are missing (the remainder is practically the same):

/controller_server
/controller_server_rclcpp_node
/global_costmap/global_costmap
/global_costmap/global_costmap_rclcpp_node
/global_costmap_client
/local_costmap/local_costmap
/local_costmap/local_costmap_rclcpp_node
/local_costmap_client
/planner_server
/planner_server_rclcpp_node

The third time, after stopping the daemon, it works as expected again.

Note, that everything else works fine and in case of the above navigation use case, the nodes are fully functional.

Additional information

This issue was raised here: ros-planning/navigation2#2145.

v-lopez commented 3 years ago

I'm seeing something similar with gazebo + ros2_control as well.

The interesting thing is that if I do: ros2 node list I get 0 nodes.

If I do ros2 node list --no-daemon I get the list of nodes.

Restarting the daemon with ros2 daemon stop; ros2 daemon start also shows all nodes.

fujitatomoya commented 3 years ago

I think that this is expected behavior for ros2 daemon, it is well described what-is-ros2-daemon.

v-lopez commented 3 years ago

Is it? I understood it as a cache of nodes and their subs/pubs/services etc... that should be transparent to use. But this cache is getting outdated and only restarting the daemon fixes it.

I could understand that it keeps some nodes as "alive" in the cache, as it takes some time of them being unresponsive before eliminating them. But I am starting new nodes and they do not show up on any commands that use the daemon, even after waiting several minutes. I have to restart the daemon or use the --no-daemon flag.

fujitatomoya commented 3 years ago

Ah, i see. you are saying

But this cache is getting outdated and only restarting the daemon fixes it.

problem-1: old cache can be seen, and will not be cleaned?

But I am starting new nodes and they do not show up on any commands that use the daemon, even after waiting several minutes.

problem-2: cache does not get updated?

Am i understanding correct?

v-lopez commented 3 years ago

Exactly, I've seen both issues.

problem-1: Cache (daemon) retaining nodes killed long ago. problem-2: Cache (daemon) not adding new nodes.

I'm trying to find reproducible examples, currently I can make it happen 100% of the time, but on a complex setup involving ros2_control with 2 controllers and launching and stopping navigation2.

There may also be underlying rmw issues causing problem-2, since I've seen that rviz2 would not list the topics from the newly spawned nodes, and even though I haven't looked in depth, I believe rviz2 has 0 relation with ros2cli.

audrow commented 3 years ago

Probably related to https://github.com/ros2/rmw_fastrtps/issues/509.

fujitatomoya commented 3 years ago

could be related to https://github.com/ros2/rmw_fastrtps/pull/514 if the communication is localhost?

BrettRD commented 3 years ago

I'm seeing this bug on a project with five nodes, FastRTPS, native Ubuntu install.

I'm using ros2 launch files, everything comes up nicely the first couple of times, but eventually ros2 node list stops seeing all of the nodes (which are definitely running). At the same time, ros2 param stops being able to interact with the hidden nodes, and ros2 topic list stops showing all of the topics.

rqt is a bit weird, there were a few time when it seemed able to find a different collection of topics and nodes to the cli tools

ros2 daemon stop; ros2 daemon start has saved my day.

fujitatomoya commented 3 years ago

@BrettRD

if your problem is related to https://github.com/ros2/rmw_fastrtps/pull/514, it would be really appreciated to try https://github.com/ros2/ros2/tree/foxy branch to check if you still meet the problem.

BrettRD commented 3 years ago

@fujitatomoya I'm currently running ros2 from apt, and this is pretty tedious to replicate with any confidence, so I'd like a sanity check on a procedure.

I'll try the following: rebuild the workspace from scratch rm -rf install/ build/ using ros from /opt/ros/foxy/setup.bash, reset the ros2 daemon launch and tear down the application a bunch and count how many times it launches before ros2 node list misses nodes

That sets an order-of-magnitude baseline for how long to test the new branch

install ros from source: clear the workspace rm -rf install/ build/ load a new terminal without ros2 from apt clone the ros2 repos into a folder in src

rebuild with colcon (including ros2 source packages) load the local setup . install/setup.bash which should reference local foxy latest reset the ros2 daemon repeat the launch and teardown until it drops nodes (confirmation not fixed) or until I get bored (inconclusive but reassuring)

Does that sound about right?

fujitatomoya commented 3 years ago

i think that sounds okay, and whole procedure is https://docs.ros.org/en/foxy/Installation/Linux-Development-Setup.html. i usually use ubuntu:20.04 docker container as base.

BrettRD commented 3 years ago

I have a result! -- Not fixed.

I built from source (55 minutes build time, after tracking down additional deps), and my build does contain ros2/rmw_fastrtps#514. I did not source /opt/ros/foxy/setup.bash, so I'm using foxy latest.

In order to trigger this bug, I have to sigint ros2 launch before all the nodes are up loading and closing fast enough to see duplicate nodes (which age out normally)

Once this bug is triggered, I can load the same 5-node launch file and ros2 node list will list a random subset of the nodes from the launchfile, but always the same number until you ros2 daemon stop, then everything goes back to normal. Other nodes like rqt and ros2 topic echo are listed fine.

I can retrigger this bug, and the size of the subset gets smaller by one node each time. I can keep triggering it until no nodes from that launch file get listed, and eventually reloading rqt doesn't list.

ZhenshengLee commented 1 year ago

Recently I've met this bug in my project, and here is what I found:

This bug still exist in apt-version of 20221012 of foxy(with rmw_fastrtps_cpp)
ros2 daemon stop and ros2 daemon start can update the nodelist effectively, but would not take effect every time, you need to try and try for couple of times.
without ros2 daemon operation, ros2 lifecycle set may return error with "node not found", may this cmd depends on the output of ros2 node list.

And I have the questions: @nielsvd @BrettRD @v-lopez

I'm not sure why rmw could cause this problem, does changing rmw would solve this issue? @fujitatomoya I've found it happen with rmw_cyclonedds in the compiled version, https://github.com/ZhenshengLee/ros2_jetson/issues/10
all ros2cli depends on rclpy, may using rclcpp would be a workaround way to bypass this issue?
does this issue being resolved in the future release of ros2, like galactic or humble?

fujitatomoya commented 1 year ago

I'm not sure why rmw could cause this problem, does changing rmw would solve this issue?

discovery protocol is implemented in RMW implementation, so changing rmw would solve the problem.

all ros2cli depends on rclpy, may using rclcpp would be a workaround way to bypass this issue?

no i do not think so, related to previous comment, discovery depends on underneath rmw implementation.

does this issue being resolved in the future release of ros2, like galactic or humble?

i cannot reproduce this issue with my local environment and rolling branch.

ZhenshengLee commented 1 year ago

@fujitatomoya thank you for your quick reply.

discovery protocol is implemented in RMW implementation, so changing rmw would solve the problem.

Thanks for your tips, I will have a try.

no i do not think so, related to previous comment, discovery depends on underneath rmw implementation.

OK, so rclcpp would not bypass the issue.

i cannot reproduce this issue with my local environment and rolling branch.

according to @v-lopez , only the complex launch would cause this node list problem.

I'm trying to find reproducible examples, currently I can make it happen 100% of the time, but on a complex setup involving ros2_control with 2 controllers and launching and stopping navigation2.

BrettRD commented 1 year ago

I have not noticed this bug in Galactic, but I encountered it immediately again when I used Humble. I have seen https://github.com/ZhenshengLee/ros2_jetson/issues/10 in galactic

fujitatomoya commented 1 year ago

@iuhilnehc-ynos @llapx can you check if we can see this problem with rolling, if you have bandwidth?

i think there is no easy reproducible procedure currently, but we can check with https://github.com/ros2/ros2cli/issues/582#issue-784108824 .

ZhenshengLee commented 1 year ago

I have not noticed this bug in Galactic, but I encountered it immediately again when I used Humble.

@BrettRD the primary difference between Galactic and Humble/Foxy is the default rmw used.

ZhenshengLee commented 1 year ago

problem-1: Cache (daemon) retaining nodes killed long ago. problem-2: Cache (daemon) not adding new nodes.

since I've seen that rviz2 would not list the topics from the newly spawned nodes, and even though I haven't looked in depth, I believe rviz2 has 0 relation with ros2cli.

from my test https://github.com/ros2/ros2cli/issues/779#issuecomment-1315117834 and the comment from @v-lopez that rviz2 will bypass the issue of node missing.

I believe the root cause would not be in the rmw layer, so changing rmw will not bypass the issue, and rclcpp/rviz2 will not see this problem.

llapx commented 1 year ago

@fujitatomoya

OK, I'll take a check.

llapx commented 1 year ago

I have tested it on ros:rolling (docker), and build turtlebot3 and navigation2 (ros:rolling no providing nav2 packages) from sources, after testing for many times, it works well.

iuhilnehc-ynos commented 1 year ago

This issue is not easy to reproduce.

But it must still be there because I can reproduce this issue with rolling (the reproducible steps are similar to https://github.com/ros2/ros2cli/issues/582#issue-784108824) a few times. After stopping the ros2 daemon in step 2 of https://github.com/ros2/ros2cli/issues/582#issue-784108824, we can immediately get the correct result of the node list.

1. ros2 daemon stop (stop ros2 daemon if it ran before)
2. ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False
3. ros2 node list | wc -l (to show 31 is good currently)
4. ctrl+c to stop step 2 and then re-launch it, re-check step 3 again

Notice that the navigation demo runs well even if the ros2 node list is incorrect.

iuhilnehc-ynos commented 1 year ago

I can't use rmw_cyclonedds_cpp to reproduce this issue.
for rmw_fastrtps_cpp, as Ctrl+C ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False can't make all processes exit normally, the shared-memory files used in the Fast-DDS are not clean successfully. I don't know if it's the root cause to make the ros2 daemon not update the node_listener -> rmw_dds_common::GraphCache::update_participant_entities anymore.
some information about ros2 daemon

top info of ros2 daemon

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3648025 chenlh    20   0  667912  79412  47136 R  99.7   0.2   4:02.62 python3       # almost 100% CPU usage
3648022 chenlh    20   0  667912  79412  47136 S   0.3   0.2   0:03.56 python3
3647989 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.40 python3
3648019 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648020 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648021 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.01 python3
3648023 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.08 python3
3648024 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648026 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648027 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.05 python3
3648028 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648029 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.02 python3

thread info of ros2 daemon

to find out the thread 3648025 is Id 8

(gdb) info thread
  Id   Target Id                                     Frame 
* 1    Thread 0x7faf51f801c0 (LWP 3647989) "python3" 0x00007faf52099d7f in __GI___poll (fds=0x7faf513bbae0, nfds=1, timeout=7200000)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  2    Thread 0x7faf4c282640 (LWP 3648019) "python3" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, 
    expected=0, futex_word=0x7faf50ceb000 <(anonymous namespace)::g_signal_handler_sem>) at ./nptl/futex-internal.c:57
  3    Thread 0x7faf4ba81640 (LWP 3648020) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7faf4ba80de0, op=137, 
    expected=0, futex_word=0x55e32f872ae0) at ./nptl/futex-internal.c:57
  4    Thread 0x7faf4b280640 (LWP 3648021) "python3" __futex_abstimed_wait_common64 (private=290346745, cancel=true, abstime=0x7faf4b27fc10, op=137, 
    expected=0, futex_word=0x55e32feb7760) at ./nptl/futex-internal.c:57
  5    Thread 0x7faf4a9f8640 (LWP 3648022) "python3" __futex_abstimed_wait_common64 (private=1326168272, cancel=true, abstime=0x7faf4a9f7c10, op=137, 
    expected=0, futex_word=0x55e32ff19bcc) at ./nptl/futex-internal.c:57
  6    Thread 0x7faf4a1f7640 (LWP 3648023) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=17, buf=0x55e32ff1c570, len=65500, flags=0, addr=..., 
    addrlen=0x7faf4a1f6a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  7    Thread 0x7faf499f6640 (LWP 3648024) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=18, buf=0x55e32ff2cd90, len=65500, flags=0, addr=..., 
    addrlen=0x7faf499f5a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  8    Thread 0x7faf491e8640 (LWP 3648025) "python3" 0x00007faf500de664 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
  9    Thread 0x7faf489e7640 (LWP 3648026) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=20, buf=0x55e32ff40070, len=65500, flags=0, addr=..., 
    addrlen=0x7faf489e6a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  10   Thread 0x7faf481d9640 (LWP 3648027) "python3" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7faf481d8940, 
    op=265, expected=0, futex_word=0x7faf470c9110) at ./nptl/futex-internal.c:57
  11   Thread 0x7faf478f8640 (LWP 3648028) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55e32ff54a28) at ./nptl/futex-internal.c:57
  12   Thread 0x7faf46d57640 (LWP 3648029) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7faf30000c04) at ./nptl/futex-internal.c:57

the backtrace for thread Id 8,

(gdb) thread 8
[Switching to thread 8 (Thread 0x7faf491e8640 (LWP 3648025))]
#0  0x00007faf500df636 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) bt
#0  0x00007faf500df636 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1  0x00007faf4f6b4163 in eprosima::fastdds::rtps::SharedMemManager::find_segment (this=0x55e32fd29aa0, id=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:1282
#2  0x00007faf4f6b22f1 in eprosima::fastdds::rtps::SharedMemManager::Listener::pop (this=0x55e32ff2ccf0)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:711
#3  0x00007faf4f6b58fb in eprosima::fastdds::rtps::SharedMemChannelResource::Receive (this=0x55e32fe3b100, remote_locator=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp:182
#4  0x00007faf4f6b556e in eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation (this=0x55e32fe3b100, input_locator=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp:133
#5  0x00007faf4f6d0579 in std::__invoke_impl<void, void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
    __f=@0x55e32ff3fa78: (void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastdds::rtps::SharedMemChannelResource * const, eprosima::fastrtps::rtps::Locator_t)) 0x7faf4f6b54e4 <eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t)>, __t=@0x55e32ff3fa70: 0x55e32fe3b100) at /usr/include/c++/11/bits/invoke.h:74
#6  0x00007faf4f6d00e2 in std::__invoke<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
    __fn=@0x55e32ff3fa78: (void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastdds::rtps::SharedMemChannelResource * const, eprosima::fastrtps::rtps::Locator_t)) 0x7faf4f6b54e4 <eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t)>) at /usr/include/c++/11/bits/invoke.h:96
#7  0x00007faf4f6cfeb3 in std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::_M_invoke<0ul, 1ul, 2ul> (this=0x55e32ff3fa58)
    at /usr/include/c++/11/bits/std_thread.h:253
#8  0x00007faf4f6cf952 in std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::operator() (this=0x55e32ff3fa58)
    at /usr/include/c++/11/bits/std_thread.h:260
#9  0x00007faf4f6cf218 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> > >::_M_run (this=0x55e32ff3fa50)
    at /usr/include/c++/11/bits/std_thread.h:211
#10 0x00007faf501c42b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007faf52015b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007faf520a7a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

https://github.com/eProsima/Fast-DDS/blob/7e12e8fe2cebf27c621263fa544f94b099504808/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp#L128-L136

    void perform_listen_operation(
            Locator input_locator)
    {
        Locator remote_locator;

        while (alive())
        {
            // Blocking receive.
            std::shared_ptr<SharedMemManager::Buffer> message;

            if (!(message = Receive(remote_locator)))
                            //////\ expect that the `Receive` can block if there is no data, but it will try to Receive the nullptr message again and again.
            {
                continue;
            }

failed to Receive by pop the message as find_segment throws an exception inside.

I don't know whether it's a bug or not because I can't reproduce this issue the first time after clearing the related shm files /dev/shm/*fastrtps*.

fujitatomoya commented 1 year ago

@iuhilnehc-ynos a couple of questions.

can't make all processes exit normally

can you point out which node or processes cannot exit normally? is that receiving exception or core crash?

I can't reproduce this issue the first time after clearing the related shm files /dev/shm/fastrtps.

i think this is good step that we found out.

Is that always the same node which cannot be listed or random node?
if we add the procedure fastdds shm clean in this procedure, problem cannot happen?

iuhilnehc-ynos commented 1 year ago

can you point out which node or processes cannot exit normally? is that receiving exception or core crash?

Press ctrl+c for ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False has different behavior each time, but most errors are from rviz2 and component_container_isolated, which might be killed by ros2 launch.

Is that always the same node which cannot be listed or random node?

It shows a random node list, but if the issue happens, the node list is almost the same as the prior while running the tb3_simulation_launch.py again, but some node names with new IDs are refreshed, such as the launch node /launch_ros_{a_new_pid}.

if we add the procedure fastdds shm clean in this procedure, problem cannot happen?

No, I tried using fastdds shm clean, but it is not enough because shared memory files for data communication are used in the node of ros2 daemon. I must stop ros2 daemon.

BTW: I think it's not difficult to reproduce this issue. Please don't be polite to the tb3_simulation_launch.py(Press ctrl+c any time you can to stop it and rerun it immediately). I have confirmed this issue with both humble and rolling.

iuhilnehc-ynos commented 1 year ago

I hope you guys can reproduce this issue on your machine, otherwise, nobody can help confirm even if I have a workaround patch :smile: .

fujitatomoya commented 1 year ago

@JLBuenoLopez-eProsima @MiguelCompany any thoughts? i believe that it is clear that shared memory file or caches used by ros2 daemon is related to the issue.

billyliuschill commented 1 year ago

I had an issue calling ros2 node list from another terminal using a python script. On occasions, there would be missing nodes at the first call, but subsequent calls would populate the node list correctly.

I tried other methods such as stopping and restarting the daemon and that seemed to work, but I felt apprehensive of that workaround as I don't fully understand the consequences. What I found what worked was adding --spin-time parameter in the call: ros2 node list --spin-time 5 That always seemed to populate the node list correctly. I hope this helps others.

What does --spin-time do?

--spin-time SPIN_TIME Spin time in seconds to wait for discovery (only applies when not using an already running daemon)

fujitatomoya commented 1 year ago

I tried other methods such as stopping and restarting the daemon and that seemed to work, but I felt apprehensive of that workaround as I don't fully understand the consequences.

downside could be discovery time for any other nodes running on that host system. daemon caches and advertises ros 2 network graph in it, then if the daemon is running, other ros 2 nodes running in the same host can find the connectivity to request the daemon without waiting entire discovery.

What does --spin-time do?

we can use this option to wait for ros 2 network graph updated until specific timeout expires. but this option is only valid when daemon is not running or --no-daemon option is specified.

JMaravalhasSilva commented 1 year ago

What I found what worked was adding --spin-time parameter in the call: ros2 node list --spin-time 5 That always seemed to populate the node list correctly. I hope this helps others.

Currently having this problem as well, but --spin-time does not work for me. The only workaround that works is using the --no-daemon option. Other commands such as ros2 param list also do not work. I'm running only a single node on humble, Ubuntu 22.04 (LTS).

Restarting the daemon also does not seem to solve the problem.

No idea if it helps, but here is the output of ros2 doctor --report while my node is running:

/opt/ros/humble/lib/python3.10/site-packages/ros2doctor/api/__init__.py: 154: UserWarning: Fail to call PackageReport class functions.

   NETWORK CONFIGURATION
inet         : 127.0.0.1
inet4        : ['127.0.0.1']
inet6        : ['::1']
netmask      : 255.0.0.0
device       : lo
flags        : 73<RUNNING,UP,LOOPBACK>
mtu          : 65536
inet         : 192.168.220.61
inet4        : ['192.168.220.61']
ether        : 3c:a9:f4:17:ec:08
inet6        : ['fe80::e214:a874:3128:3e04%wlo1']
netmask      : 255.255.0.0
device       : wlo1
flags        : 4163<BROADCAST,UP,MULTICAST,RUNNING>
mtu          : 1500
broadcast    : 192.168.255.255
ether        : 2c:59:e5:03:b0:46
device       : enp0s25
flags        : 4099<BROADCAST,UP,MULTICAST>
mtu          : 1500

   PLATFORM INFORMATION
system           : Linux
platform info    : Linux-5.19.0-35-generic-x86_64-with-glibc2.35
release          : 5.19.0-35-generic
processor        : x86_64

   QOS COMPATIBILITY LIST
topic [type]            : /parameter_events [rcl_interfaces/msg/ParameterEvent]
publisher node          : _ros2cli_daemon_42_3d320951c78f477dbb7ee7a28c576fda
subscriber node         : _NODE_NAME_UNKNOWN_
compatibility status    : OK
topic [type]            : /parameter_events [rcl_interfaces/msg/ParameterEvent]
publisher node          : _NODE_NAME_UNKNOWN_
subscriber node         : _NODE_NAME_UNKNOWN_
compatibility status    : OK

   RMW MIDDLEWARE
middleware name    : rmw_fastrtps_cpp

   ROS 2 INFORMATION
distribution name      : humble
distribution type      : ros2
distribution status    : active
release platforms      : {'debian': ['bullseye'], 'rhel': ['8'], 'ubuntu': ['jammy']}

   TOPIC LIST
topic               : none
publisher count     : 0
subscriber count    : 0

Again, not sure if helpful, but when I installed ROS2, I added the following lines to ~/.bashrc:

# ROS 2 configs
source /opt/ros/humble/setup.bash
export ROS_DOMAIN_ID=42
export ROS_LOCALHOST_ONLY=1

mcres commented 1 year ago

In case this helps, I can also reproduce the issue as follows. Note that I put turtlesim as an example, but I obtain the same results with e.g. custom launch files.

With ros2 run turtlesim turtlesim_node running, I get the nodes and topics right:

$ ros2 node list -a
/_ros2cli_daemon_0_41534b5a8f1d43cfbb3b1ee12d408355
/turtlesim
$ ros2 topic list
/parameter_events
/rosout
/turtle1/cmd_vel
/turtle1/color_sensor
/turtle1/pose

Kill the turtlesim_node
Login as root user (su command) and
```
root# source /opt/ros/humble/setup.bash
root# ros2 run turtlesim turtlesim_node
```
Turtlesim is correctly launched (note that to reproduce the issue it's not enough to source ros2, the turtlesim_node must be launched as well). However, I can already only see this information:
```
root# ros2 node list -a
/_ros2cli_daemon_0_41534b5a8f1d43cfbb3b1ee12d408355
root# ros2 topic list
/parameter_events
/rosout
```

After exiting su and launching turtlesim as a regular user, the problem persists:

$ ros2 node list -a
/_ros2cli_daemon_0_41534b5a8f1d43cfbb3b1ee12d408355
$ ros2 topic list
/parameter_events
/rosout

After ros2 daemon stop && ros2 daemon start and launching turtlesim, I can see the right information again:

$ ros2 node list -a
/_ros2cli_daemon_0_b175e66117984230bf91ab71681160d6
/turtlesim
$ ros2 topic list
/parameter_events
/rosout
/turtle1/cmd_vel
/turtle1/color_sensor
/turtle1/pose

Here's my ros2 doctor --report:

PLATFORM INFORMATION
system           : Linux
platform info    : Linux-5.19.0-46-generic-x86_64-with-glibc2.35
release          : 5.19.0-46-generic
processor        : x86_64

   QOS COMPATIBILITY LIST
compatibility status    : No publisher/subscriber pairs found

   RMW MIDDLEWARE
middleware name    : rmw_fastrtps_cpp

   ROS 2 INFORMATION
distribution name      : humble
distribution type      : ros2
distribution status    : active
release platforms      : {'debian': ['bullseye'], 'rhel': ['8'], 'ubuntu': ['jammy']}

Here my environment variables (same for both regular and root users):

$ printenv | grep ROS
ROS_VERSION=2
ROS_PYTHON_VERSION=3
ROS_LOCALHOST_ONLY=0
ROS_DISTRO=humble

Edit: A couple things I'd like to add for clarification:

I know it doesn't make sense to use turtlesim as root, but I do need those permissions when working with my own launch files.
ros2cli works again if I ros2 daemon stop && ros2 daemon start while being root, no need to do it as regular user. Both options seem to fix the issue.

fujitatomoya commented 1 year ago

@iuhilnehc-ynos

can you evaluate 2 PRs introduced by https://github.com/ros2/rmw_fastrtps/issues/699#issuecomment-1653795722 with reproducible procedure in this issue?

iuhilnehc-ynos commented 1 year ago

@fujitatomoya

After testing many times with running ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False, ros2 node list | wc -l can get the fixed count(31) in the final.

I believe this issue is fixed by the https://github.com/eProsima/Fast-DDS/pull/3753

fujitatomoya commented 1 year ago

@iuhilnehc-ynos great news! thanks for checking.

fujitatomoya commented 1 year ago

@iuhilnehc-ynos thanks for testing, i will go ahead to close.

ros2 / ros2cli