ros2 / geometry2

A set of ROS packages for keeping track of coordinate transforms.
BSD 3-Clause "New" or "Revised" License
118 stars 195 forks source link

Fix memory leak in Transform Listener #630

Closed zygfrydw closed 4 months ago

zygfrydw commented 1 year ago

There is a memory / message leak when using TransformListener constructed with internal node.

When TransformListener is constructed in the following way:

auto tf_buffer = tf2_ros::Buffer(node->get_clock());
auto tf_listener = tf2_ros::TransformListener(tf_buffer);

There is a memory leake due to messages published to /parameter_events which is not served in the internal node. We have observed it when using CycloneDDS with shared memory (Iceoryx) and we have created a repo to reproduce this error.

The reason for this issue is that, when in TransformListener is created dedicated thread, the subcriptions for /tf and /tf_static topic are added to callback group. When internal node is created, the node creates subcription on topic /parameter_events. However this subscription is created in default callback group which is never spinned by the executor (the executor is told explicitly to run only the callback group with /tf and /tf_static). The topic /parameter_events is created with QOS with history of 1000 messages. Therefore not served messages live forever in /parameter_events for each TransformListener. In the application with default CycloneDDS this results in small, constant memory leak in each instance of TransformListener, however when messages are taken from pool like when using Shared Memory, the memory pool is exausted quickly.

The /parameter_events is created in rclcpp::TimeSource::NodeState in rclcpp/src/rclcpp/time_source.cpp:270.

In my opinion this subcription on parameter events is not necessary in this internal node, however I do not see any option to disable this. Additionaly, the problem will come back if someone adds other subscriptions in Node in future, hance I perfer current solution with adding default callback group to executor.

Other possible solution is:

executor_ = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
if(optional_default_node_)
{
  // If the optional_default_node is created it is necessary to add the default callback group
  // to ensure that internal node subscriptions like /parameter_events are processed.
  executor_->add_node(node_base_interface_);
} else{
  executor_->add_callback_group(callback_group_, node_base_interface_);
}
clalancette commented 1 year ago

In my opinion this subcription on parameter events is not necessary in this internal node, however I do not see any option to disable this.

So we do actually need to react to parameter events in TimeSource. The reason is so that we can do the right thing when the user enables or disables simulation time via the use_sim_time parameter.

However, we don't exactly need to subscribe to it anymore. Since Iron, we have the ability to get a callback after a parameter has been updated, using the post_set_parameters_callback (like in https://github.com/ros2/demos/blob/36a4c0bd6ae0531ad3096806c41efd7062a1cb31/demo_nodes_cpp/src/parameters/set_parameters_callback.cpp#L86-L106). Additionally, using the post_set_parameters_callback will be more performant than the subscription.

So my suggestion is that instead we open a PR to rclcpp, replacing https://github.com/ros2/rclcpp/blob/9284d7cefa9e37ce3b6926a44e6177d6e4a4b62a/rclcpp/src/rclcpp/time_source.cpp#L279-L288 with an add_on_post_set_parameters_callback. That will solve this leak.

We can also consider taking this PR, for the reasons you describe, but I think that is a separate issue. @zygfrydw are you willing to open that PR to rclcpp?

zygfrydw commented 1 year ago

I am not exactly sure what it changes to replace the parameter_subscription_ with post_set_parameters_callback in terms of this pull request.

In the TimeSource::NodeState::attachNode there is already a subscription to on_set_parameters_callback:

on_set_parameters_callback_ = node_parameters_->add_on_set_parameters_callback(
  std::bind(&TimeSource::NodeState::on_set_parameters, this, std::placeholders::_1));

post_set_parameters_callback = rclcpp::AsyncParametersClient::on_parameter_event(
  node_topics_,
  [this](std::shared_ptr<const rcl_interfaces::msg::ParameterEvent> event) {
    if (node_base_ != nullptr) {
      this->on_parameter_event(event);
    }
  });

I do not understand why both on_set_parameters_callback and parameter_subscription_ are necessary and why changing the parameter_subscription_ to post_set_parameters_callback would change anything.

In my opinion, it is not correct to spin only the /tf and /tf_static methods of the internal Transform Listener node. I am not aware of other elements of node routines that require spinning, so removing post_set_parameters_callback may be sufficient for now. However in my opinion it is a fragile fix because if in the future someone adds some subscription to Node the issue will come back.

We are currently locked on the ROS Humble version so we hope you merge this pull request to ROS Humble as well.

jayyoung commented 11 months ago

Hello, We ran into what looks like this issue and discovered this PR. During our investigation we independently discovered that all of our memory-leaking nodes were ones that instantiated a TransformListener in the way you described in your example. We're testing the fix here now, but I wanted to drop by to show the effects of the memory leak, as we did quite some testing on multiple robots in our fleet, and in simulation, to get a picture of its severity, and what triggers it.

Here is a graph of time vs. memory usage of every node in our system, gathered over 548 minutes (9.1 hours) of robot idle time. The lines with an incline are nodes with TransformListener instantiations that leaked memory. The flat lines are the rest of our system.

results_gh

From eyeballing these results, it looks to me like every TransformListener leaks about ~20mb of memory over that ~9hr period, and nodes with multiple instantiations of a TransformListener leak harder than those with just one, or none. This fits your explanation of a never-decreasing, immortal message queue per-instance of that object. In our case, this overall accumulation led to an extra ~1.2gb of memory usage by our system. This is pretty severe for those of us doing long-term robotics tasks.

Just wanted to put here some experimental support for your diagnosis. Not sure if the fix here helps us yet, but we are very suspicious that this object is the one that is leaky, and we will report back.

nachovizzo commented 10 months ago

I also created a small repository to run experiments. I spawn a dummy node that doesn't do anything, just advertise a parameter which I later use to send a big string payload. After putting the payload into the parameter, I started seeing memory leaks. The confusing part is that with/without TransformListener I still observe leaks, and with 2 different DDS implementations.

Indeed if a tf2_ros::TransformListener object is part of the nodes the memory leak is notable

LeakyNode - No tf stuff

class LeakyNode : public rclcpp::Node {
public:
    explicit LeakyNode(const std::string &name) : Node(name) {
        this->declare_parameter("my_parameter", "world");
    }
};

leaky_node

LeakyNode - with dummy TransformListener

class LeakyNode : public rclcpp::Node {
public:
    explicit LeakyNode(const std::string &name) : Node(name) {
        this->declare_parameter("my_parameter", "world");
        tf_buffer = std::make_unique<tf2_ros::Buffer>(this->get_clock());
        tf_listener = std::make_unique<tf2_ros::TransformListener>(*tf_buffer);
    }
    std::unique_ptr<tf2_ros::TransformListener> tf_listener;
    std::unique_ptr<tf2_ros::Buffer> tf_buffer;
};

image

Conclusions

I don't have anyone yet :cry: So far all the experiments I have done are in some way contradictory (common in memory leak hunting). Any leads on how to nail down this problem would be highly appreciated

ahcorde commented 4 months ago

@zygfrydw Was this issue fixed with the @nachovizzo PR https://github.com/ros2/geometry2/pull/636 ?

Feel free to reopen it again, if the problem is still there