ros2 / rmw_zenoh

RMW for ROS 2 using Zenoh as the middleware
Apache License 2.0
144 stars 29 forks source link

Service Call Hangup #143

Open bagelbytes61 opened 3 months ago

bagelbytes61 commented 3 months ago

I'd like to preface this with I am not quite sure yet if this is an rmw_zenoh issue or an eclipse zenoh/zenoh-c issue.

Some basic system information : Platform: Ubuntu 24.04 Rustc: 1.75.0 GCC: 9.5.0 ROS2 packages: latest rolling as of 03/27/2024, built from source (will try to reproduce w/ latest and greatest today)

The issue: It seems that when making service calls to a ROS2 node that the service call will hangup and not return (or even just segfault as I have seen, but that was only reproducible w/ custom debug builds), however this has only been observed to occur under some specific circumstances. Firstly, the server node in this case must be logging via the RCLCPP_* calls (we use RCLCPP_INFO for example). Secondly, there should be a publisher node that the server is subscribed to. We are publishing ~245 KiB at a rate of 50 Hz, so about 12 MiB/s. The second step isn't strictly necessary, but it seems to help trigger the issue. In my debug builds I didn't need to run the publisher node.

Through my own debugging it seems that an indefinite wait occurs here (https://github.com/eclipse-zenoh/zenoh/blob/05b9cb459f77693bbf7c89d67265ba1519959814/zenoh/src/net/routing/dispatcher/pubsub.rs#L435) waiting for the writer to release its hold of the RwLock. I have been unsuccessful in determining where that writer lock is being acquired, if that is in fact what is going on here, however as acquiring said lock is done via an RwLockWriteGuard object, whenever that object is dropped/goes out of scope, ownership should at that point be relinquished and the reader free to acquire the lock. I've also inspected the call stacks of other threads and did not observe any lingering RwLockWriteGuard's still on the stack.

This problem occurs in both normal operation and shared_memory mode.

An interesting note too is that directly publishing to /rosout is not sufficient to trigger the bug -- logging MUST be via the RCLCPP_* interface.

I will update this issue as I make more debugging progress.

To reproduce:

Build and run the Docker container located here: https://github.com/jackg0/ros2_simple_tests

Launch the zenoh router via ros2 run rmw_zenoh_cpp rmw_zenohd

Launch both the simple server and the simple publisher

Trigger the issue via making service calls to the server with ros2 service call /simple_server/trigger std_srvs/srv/Trigger

I hope that this information can provide some insight and as I said above I will be actively debugging this myself and providing updates as they come.

Yadunund commented 3 months ago

For sanity check, does the system work as expected with rmw_fastrtps_cpp and rmw_cyclonedds_cpp?

JEnoch commented 3 months ago

I ran the test files in a Iron setup (not in Docker) with rmw_cyclonedds_cpp and rmw_zenoh_cpp (both rolling and https://github.com/ros2/rmw_zenoh/pull/146 branches). The ros2 service call /simple_server/trigger std_srvs/srv/Trigger was always successful.

Could it be a temporary issue with rclcpp in Rolling or with your Dockerized environment ?

bagelbytes61 commented 1 month ago

Apologies for not getting back to you for so long.

I've done some more testing and have confirmed on both Rolling and Jazzy that the issue is still present and can be triggered by the same steps as described in the issue. I will say that it may take a fair bit of service calls to trigger the hang up, so I would recommend doing something like:

while true; do 
ros2 service call /simple_server/trigger std_srvs/srv/Trigger; 
done

Also it does not appear to occur w/ other RMW's.

I don't have any other updates on this issue, but just wanted to get back to you guys on this.

Yadunund commented 1 month ago

@bagelbytes61 thanks for following up. Could I confirm that you tested after https://github.com/ros2/rmw_zenoh/pull/134 was merged? (You'll need to do a clean build and also ensure you kill the daemon with ros2 daemon stop before that)

Yadunund commented 1 month ago

@bagelbytes61 I'm unable to reproduce the issue with the latest rolling branch in rmw_zenoh.

I've been continuously making the service call for over 15mins now and am able to get a successful response each time. See video below for a snippet from this test (the warning can be ignored as it is a known issue and caused by something else).

https://github.com/ros2/rmw_zenoh/assets/13482049/b50593e5-eba2-4e81-a26c-ad0fa5d04802

bagelbytes61 commented 1 month ago

Hmm. Interesting. Could I ask which version of rustc you are using to compile zenoh-c? I have been using 1.75.0, though I just tried 1.78.0 and experienced the same behavior. Also could I ask about your environment? I am using a fresh Ubuntu 24.04 virtual machine that I apt install'd ros-rolling-desktop onto and built the rmw_zenoh_cpp package with GCC 13.2.0 and rustc 1.75/1.78.

Yadunund commented 1 month ago

I'm compiling with 1.75.0 rustc. Clean Ubuntu 24.04 but I have ROS 2 Rolling built from source. I don't see the problem with Iron binaries on 22.04 either.

I do have everything built in Release mode.

When you start continuously making service calls, do you see the problem immediately or after several minutes? I could try running my test for longer...

Yadunund commented 1 month ago

Okay I was able to reproduce it after running for a longer time.

Yadunund commented 1 month ago

@bagelbytes61 I opened PR https://github.com/jackg0/ros2_simple_tests/pull/3 to add a script that creates a node which makes the service calls indefinitely. I've had this script running for over 2 hours now without any hangup issues. There is one big difference between between this script and running ros2 service call ... within a bash while loop. In the former, the ROS context and hence zenoh session is created once and then service calls are made indefinitely. In the latter, the context and session is created each time the ros2 service call command is run and more importantly destructed each time. There are some known issues related to shutting down the zenoh session eg. https://github.com/ros2/rmw_zenoh/issues/170 which could be the underlying cause.