ros2 / rclpy

rclpy (ROS Client Library for Python)
Apache License 2.0
308 stars 225 forks source link

High-Latency networks: "generator already executing" when calling a service #1351

Open mmatthebi opened 2 months ago

mmatthebi commented 2 months ago

Bug report

We have observed a strange behaviour when high network latency is occuring between an action client and server. Such setup can e.g. happen when two robots are connected via an LTE link. I could replicate the error with two docker containers and according traffic shaping.

image

Required Info:

Steps to reproduce issue

See attached zip file bugreport.zip which contains a docker-compose script along with the source code for an action server and an action client. The architecture is as follows:

Expected behavior

The behaviour is as expected when there is a low-latency connection between the containers:

  1. start up the docker containers with docker-compose up -d
  2. login into both docker containers by (in different terminals)
    docker exec -it client1 /bin/bash
    docker exec -it client2 /bin/bash
  3. in both containers start tmux, move to /app and call . /opt/ros/humble/setup.bash
  4. in client1: python3 action_server.py
  5. in client2, open multiple tmux panes (pane1, pane2, pane3)
  6. pane1: python3 action_client.py
  7. pane2: ros2 service call /mode std_srvs/srv/SetBool '{data: true}'
  8. pane3: ros2 topic pub /data std_msgs/msg/String -r 10

From this point on, the action_server is constantly performing actions and the client is reporting on the console. Now, stop the action calls by calling the service:

ros2 service call /mode std_srvs/srv/SetBool '{data: false}'

The client stops executing the actions as expected.

Actual behavior

Now, we add a network latency between the containers by calling

./delay.sh  # equivalent to tc qdisc add dev eth0 root netem delay 200ms

Additional information

The blocking of the service call depends on the rate of the /data publisher. Essentially, it seems if the service call to stop the actions is incoming while a packet transmission is ongoing, the call blocks forever and the crash occurs.

I know that the architecture of the action_client.py is a bit strange, as there are two nodes in the process where one is creating the other. However, we have seen such architecture in the wild where this bug occured. It would be of great help to understand what the cause is and how to fix it.

jmblixt3 commented 2 months ago

This certainly sounds like #1123, so potentially try to apply pr #1308 and see if it fixes your issue.

fujitatomoya commented 2 months ago

This certainly sounds like https://github.com/ros2/rclpy/issues/1123, so potentially try to apply pr https://github.com/ros2/rclpy/pull/1308 and see if it fixes your issue.

@jmblixt3

I do not really find the logical reason why https://github.com/ros2/rclpy/pull/1308 can fix this... can you explain your thought a bit more?

IMO, this is because the same SingleThreadedExecutor (global executor object) object is called in multi-threaded manner by MultiThreadedExecutor, and it will call _spin_once_impl concurrently, finally ValueError raised by Generator which is _wait_for_ready_callbacks.

The reason why we have this problem only with extra delay 200ms, is that 1st service call is still waiting to be completed because of this delay, and then 2nd call comes in after that. (without this delay, service call will be completed much quicker so 1st spin_until_future_complete of global executor is completed before 2nd request comes in.)

@mmatthebi thanks for the detailed information.

can you try with the following patch for your application to see if that works?

from rclpy.executors import SingleThreadedExecutor, MultiThreadedExecutor
...
        se = SingleThreadedExecutor() # HERE
        rclpy.spin_until_future_complete(self, send_goal_future, executor=se) # HERE
mmatthebi commented 2 months ago

thank you all for your help! The hint from #1123 did not fix it unfortunately. Though, if running with rolling, the error does not occur, so it seems to be fixed somewhere internally already.

However, huge huge thanks to @fujitatomoya for providing a workaround! I have implemented it in the testcase and the error goes away. I will try this in production and see if it helps there as well. I'll let you know. Thanks again!

fujitatomoya commented 2 months ago

@mmatthebi thanks for checking that.

Though, if running with rolling, the error does not occur, so it seems to be fixed somewhere internally already.

one question, did you build the source code or just use the released package? can you share the version/commit hash of either package or source code?

mmatthebi commented 2 months ago

I ran the tests again. With rolling, the service call still blocks sometimes, but a subsequent call does not crash with the "generator already executing" bug. Though, it seems that the action calls to the action server do not get through smoothly but in bursts. Instead, when using your proposed solution with the dedicated executor, the service calls do not block and also the action calls go through more smoothly.

I used the packaged rolling version from the ros2:rolling docker container:

root@client2:/app# apt-cache show ros-rolling-desktop
Package: ros-rolling-desktop
Version: 0.11.0-1noble.20240820.020842
Architecture: amd64
Maintainer: Geoffrey Biggs <geoff@openrobotics.org>