High-Latency networks: "generator already executing" when calling a service

mmatthebi commented 2 months ago

Bug report

We have observed a strange behaviour when high network latency is occuring between an action client and server. Such setup can e.g. happen when two robots are connected via an LTE link. I could replicate the error with two docker containers and according traffic shaping.

Required Info:

Operating System: Ubuntu 22.04
Installation type: Binaries
Version or commit hash: ros-humble-desktop 0.10.0-1jammy.20240617.124725, but has been reproduced with Iron as well.
DDS implementation: Humble default
Client library (if applicable): probably rclpy

Steps to reproduce issue

See attached zip file bugreport.zip which contains a docker-compose script along with the source code for an action server and an action client. The architecture is as follows:

action_server.py provides an basic action
action_client.py provides
- an action client to above action server
- an service server /mode to enable/disable calling the action in the server
- a subscription /data which triggers the action call upon each reception on this topic, when the action calls have been enabled by the service.
a script to enable network latency between two docker containers.

Expected behavior

The behaviour is as expected when there is a low-latency connection between the containers:

start up the docker containers with docker-compose up -d

login into both docker containers by (in different terminals)

docker exec -it client1 /bin/bash
docker exec -it client2 /bin/bash

in both containers start tmux, move to /app and call . /opt/ros/humble/setup.bash
in client1: python3 action_server.py
in client2, open multiple tmux panes (pane1, pane2, pane3)
pane1: python3 action_client.py
pane2: ros2 service call /mode std_srvs/srv/SetBool '{data: true}'
pane3: ros2 topic pub /data std_msgs/msg/String -r 10

From this point on, the action_server is constantly performing actions and the client is reporting on the console. Now, stop the action calls by calling the service:

ros2 service call /mode std_srvs/srv/SetBool '{data: false}'

The client stops executing the actions as expected.

Actual behavior

Now, we add a network latency between the containers by calling

./delay.sh  # equivalent to tc qdisc add dev eth0 root netem delay 200ms

Check if there is a delay of total 400ms between the containers by pinging the containers each other.
Now, perform steps 4.-8. from previous section. As before, the actions calls are performed by the action server.
Then, trying to stop the action calls by calling the service results in a blocking service call:
```
ros2 service call /mode std_srvs/srv/SetBool '{data: false}'
```

Upon Ctrl-C-ing this service call and calling it again, the client crashes with this stack trace:

  main()
File "/app/action_client.py", line 89, in main
rclpy.spin(action_client, e)
File "/opt/ros/humble/local/l
ib/python3.10/dist-packages/rclpy/__init__.py", line 222, in spin
executor.spin_once()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 794, in spin_once
self._spin_once_impl(timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 791, in _spin_once_impl
future.result()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 94, in result
raise self.exception()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
self._handler.send(None)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 437, in handler
await call_coroutine(entity, arg)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 391, in _execute_service
response = await await_or_execute(srv.callback, request, srv.srv_type.Response())
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 107, in await_or_execute
return callback(*args)
File "/app/action_client.py", line 75, in _onMode
self._acClient.cancel_goal()
File "/app/action_client.py", line 42, in cancel_goal
rclpy.spin_until_future_complete(self, future)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/__init__.py", line 248, in spin_until_future_complete
executor.spin_until_future_complete(future, timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 303, in spin_until_future_complete
self.spin_once_until_future_complete(future, timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 746, in spin_once_until_future_complete
self._spin_once_impl(timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 728, in _spin_once_impl
handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 711, in wait_for_ready_callbacks
return next(self._cb_iter)
ValueError: generator already executing

Additional information

The blocking of the service call depends on the rate of the /data publisher. Essentially, it seems if the service call to stop the actions is incoming while a packet transmission is ongoing, the call blocks forever and the crash occurs.

I know that the architecture of the action_client.py is a bit strange, as there are two nodes in the process where one is creating the other. However, we have seen such architecture in the wild where this bug occured. It would be of great help to understand what the cause is and how to fix it.

jmblixt3 commented 2 months ago

This certainly sounds like #1123, so potentially try to apply pr #1308 and see if it fixes your issue.

fujitatomoya commented 2 months ago

This certainly sounds like https://github.com/ros2/rclpy/issues/1123, so potentially try to apply pr https://github.com/ros2/rclpy/pull/1308 and see if it fixes your issue.

@jmblixt3

I do not really find the logical reason why https://github.com/ros2/rclpy/pull/1308 can fix this... can you explain your thought a bit more?

IMO, this is because the same SingleThreadedExecutor (global executor object) object is called in multi-threaded manner by MultiThreadedExecutor, and it will call _spin_once_impl concurrently, finally ValueError raised by Generator which is _wait_for_ready_callbacks.

The reason why we have this problem only with extra delay 200ms, is that 1st service call is still waiting to be completed because of this delay, and then 2nd call comes in after that. (without this delay, service call will be completed much quicker so 1st spin_until_future_complete of global executor is completed before 2nd request comes in.)

@mmatthebi thanks for the detailed information.

can you try with the following patch for your application to see if that works?

from rclpy.executors import SingleThreadedExecutor, MultiThreadedExecutor
...
        se = SingleThreadedExecutor() # HERE
        rclpy.spin_until_future_complete(self, send_goal_future, executor=se) # HERE

mmatthebi commented 2 months ago

thank you all for your help! The hint from #1123 did not fix it unfortunately. Though, if running with rolling, the error does not occur, so it seems to be fixed somewhere internally already.

However, huge huge thanks to @fujitatomoya for providing a workaround! I have implemented it in the testcase and the error goes away. I will try this in production and see if it helps there as well. I'll let you know. Thanks again!

fujitatomoya commented 2 months ago

@mmatthebi thanks for checking that.

Though, if running with rolling, the error does not occur, so it seems to be fixed somewhere internally already.

one question, did you build the source code or just use the released package? can you share the version/commit hash of either package or source code?

mmatthebi commented 2 months ago

I ran the tests again. With rolling, the service call still blocks sometimes, but a subsequent call does not crash with the "generator already executing" bug. Though, it seems that the action calls to the action server do not get through smoothly but in bursts. Instead, when using your proposed solution with the dedicated executor, the service calls do not block and also the action calls go through more smoothly.

I used the packaged rolling version from the ros2:rolling docker container:

root@client2:/app# apt-cache show ros-rolling-desktop
Package: ros-rolling-desktop
Version: 0.11.0-1noble.20240820.020842
Architecture: amd64
Maintainer: Geoffrey Biggs <geoff@openrobotics.org>

ros2 / rclpy