Possible memory leak in Cyclone/Iceoryx subscriber history queue

ksuszka commented 1 year ago

Bug report

Required Info:

Operating System:
- Shown using official ros:humble docker image
Installation type:
- Official ros:humble docker image
Version or commit hash:
- ros-humble-cyclonedds/jammy,now 0.10.3-1jammy.20230822.172333 amd64 [installed,automatic]
- ros-humble-rmw-cyclonedds-cpp/jammy,now 1.3.4-1jammy.20230919.205940 amd64 [installed]
- ros-humble-iceoryx-binding-c/jammy,now 2.0.3-1jammy.20230623.023950 amd64 [installed,automatic]
- ros-humble-iceoryx-hoofs/jammy,now 2.0.3-1jammy.20230623.020506 amd64 [installed,automatic]
- ros-humble-iceoryx-posh/jammy,now 2.0.3-1jammy.20230623.021542 amd64 [installed,automatic]
DDS implementation:
- rmw_cyclonedds_cpp
Client library (if applicable):
- rclcpp

Steps to reproduce issue

I prepared separate repo with two very simple applications (publisher/subscriber) to show the issue. https://github.com/ksuszka/cyclonedds_iceoryx_memory_leak/tree/chunks-leak

Using this repo: Build docker image:

docker build -f Dockerfile -t cyclone-leak-test .

Open four terminal windows. In the first terminal window run:

docker run -it --shm-size 1GB --rm --name cyclone-leak-test cyclone-leak-test iox-roudi

In the second terminal window run:

docker exec -it cyclone-leak-test bash -c ". /ws/install/setup.bash && ros2 run test_listener test_listener"

In the third terminal window run:

docker exec -it cyclone-leak-test bash -c ". /ws/install/setup.bash && while true; do ros2 run test_publisher test_publisher; done"

In the fourth terminal window again run:

docker exec -it cyclone-leak-test bash -c ". /ws/install/setup.bash && while true; do ros2 run test_publisher test_publisher; done"

And wait a minute.

After some time you will most likely start to get errors:

1696239369.966511 [0] test_liste: DDS reader with topic rt/topic : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample

Expected behavior

Messages which cannot be processed due to a slow subscriber are dropped silently.

Actual behavior

Messages which cannot be processed due to a slow subscriber are dropped silently for a few seconds and then Iceoryx errors start to appear.

Additional information

In this example a really slow subscriber has a QoS with a history depth slightly smaller (250) than the maximum history depth available in the precompiled ros-humble-iceoryx-* package (256). When subscriber's queue is filled up it should stay at a constant size and it is mostly the case if there is a single very fast publisher. If there are multiple parallel publishers is starts to slowly leak, what can be observed with the iox-introspection-client (if it is compiled separately).

The issue is easily reproducible if publishers abort execution abruptly (this method is used in the example repository), however AFAIK it is not a requirement for the issue to occur. We noticed the issue with leaking chunks in our system which has a few dozen of nodes and then we tried to find an easily reproducible, simple case.

For more background, we found this issue due to another possible bug: https://github.com/ros2/geometry2/pull/630. That bug makes tf_buffer a really slow reader of the /parameter_events topic. This topic has QoS history of 1000 so it cannot be even handled at the moment with the default Iceoryx limits. We recompiled Iceoryx with history depth of 4096 and the system seemed to work fine for a few hours, but after a few hours we started to get errors that too many chunks were held in parallel on topic /parameter_events which didn't make sense. But then we observed with the iox-introspection-client that if you start some simple random node with the default parameters handling, and next you start to spawn and close in parallel other, unrelated nodes that broadcast their parameters, the number of memory chunks held by the first node slowly and randomly increases over time.

eboasson commented 1 year ago

This sounds like an issue with Iceoryx more than an issue with Cyclone itself: we just use the Iceoryx API to publish data and subscribe to it. It would be good to see whether the Iceoryx guys agree with that initial assessment.

I am not sure whom to ping for help, perhaps @elBoberido?

elBoberido commented 1 year ago

@ksuszka what do you mean with aborting the execution abruptly? Is there still a graceful shutdown or is the application killed.

@eboasson this error happens when the chunks taken from the subscriber are not released. Every time a chunk is taken out of the queue it is internally stored in a fixed size used chunk list. When the list is full and there is no space for further tracking the subscriber will immediately release the chunk and return with the TOO_MANY_CHUNKS_HELD_IN_PARALLEL error. It might be a problem in the rmw implementation but I'm not quite sure whom to ask.

ksuszka commented 1 year ago

@elBoberido it throws an exception: https://github.com/ksuszka/cyclonedds_iceoryx_memory_leak/blob/d92c07b7c3400bafcfffbc9dd59f62afc638693a/test_publisher/main.cpp#L16

nachovizzo commented 1 year ago

Writing here just to link the issues ;)

We are also experiencing memory leaks that might be / might not be related to CycloneDDS, more information here: https://github.com/ros2/geometry2/pull/630#issuecomment-1808003236

doisyg commented 1 year ago

FYI, for us (with @nachovizzo) the root cause was actually with tf2: https://github.com/ros2/geometry2/pull/636

ros2 / rmw_cyclonedds