Open ksuszka opened 1 year ago
This sounds like an issue with Iceoryx more than an issue with Cyclone itself: we just use the Iceoryx API to publish data and subscribe to it. It would be good to see whether the Iceoryx guys agree with that initial assessment.
I am not sure whom to ping for help, perhaps @elBoberido?
@ksuszka what do you mean with aborting the execution abruptly? Is there still a graceful shutdown or is the application killed.
@eboasson this error happens when the chunks taken from the subscriber are not released. Every time a chunk is taken out of the queue it is internally stored in a fixed size used chunk list. When the list is full and there is no space for further tracking the subscriber will immediately release the chunk and return with the TOO_MANY_CHUNKS_HELD_IN_PARALLEL
error. It might be a problem in the rmw implementation but I'm not quite sure whom to ask.
@elBoberido it throws an exception: https://github.com/ksuszka/cyclonedds_iceoryx_memory_leak/blob/d92c07b7c3400bafcfffbc9dd59f62afc638693a/test_publisher/main.cpp#L16
Writing here just to link the issues ;)
We are also experiencing memory leaks that might be / might not be related to CycloneDDS, more information here: https://github.com/ros2/geometry2/pull/630#issuecomment-1808003236
FYI, for us (with @nachovizzo) the root cause was actually with tf2: https://github.com/ros2/geometry2/pull/636
Bug report
Required Info:
Steps to reproduce issue
I prepared separate repo with two very simple applications (publisher/subscriber) to show the issue. https://github.com/ksuszka/cyclonedds_iceoryx_memory_leak/tree/chunks-leak
Using this repo: Build docker image:
Open four terminal windows. In the first terminal window run:
In the second terminal window run:
In the third terminal window run:
In the fourth terminal window again run:
And wait a minute.
After some time you will most likely start to get errors:
Expected behavior
Messages which cannot be processed due to a slow subscriber are dropped silently.
Actual behavior
Messages which cannot be processed due to a slow subscriber are dropped silently for a few seconds and then Iceoryx errors start to appear.
Additional information
In this example a really slow subscriber has a QoS with a history depth slightly smaller (250) than the maximum history depth available in the precompiled ros-humble-iceoryx-* package (256). When subscriber's queue is filled up it should stay at a constant size and it is mostly the case if there is a single very fast publisher. If there are multiple parallel publishers is starts to slowly leak, what can be observed with the iox-introspection-client (if it is compiled separately).
The issue is easily reproducible if publishers abort execution abruptly (this method is used in the example repository), however AFAIK it is not a requirement for the issue to occur. We noticed the issue with leaking chunks in our system which has a few dozen of nodes and then we tried to find an easily reproducible, simple case.
For more background, we found this issue due to another possible bug: https://github.com/ros2/geometry2/pull/630. That bug makes tf_buffer a really slow reader of the /parameter_events topic. This topic has QoS history of 1000 so it cannot be even handled at the moment with the default Iceoryx limits. We recompiled Iceoryx with history depth of 4096 and the system seemed to work fine for a few hours, but after a few hours we started to get errors that too many chunks were held in parallel on topic /parameter_events which didn't make sense. But then we observed with the iox-introspection-client that if you start some simple random node with the default parameters handling, and next you start to spawn and close in parallel other, unrelated nodes that broadcast their parameters, the number of memory chunks held by the first node slowly and randomly increases over time.