ros2 / rclpy

rclpy (ROS Client Library for Python)
Apache License 2.0
289 stars 224 forks source link

Rate object's destroy() does not destroy the underlying Timer object, leading to CPU usage explosion #1278

Closed 2lian closed 3 months ago

2lian commented 4 months ago

Bug report

Steps to reproduce issue

Create a Node with rclpy. In a callback: create a Rate, wait it then destroy it. The more you call this callback the more the cpu usage will increase. CPU usage I report come from $ top

The following code maxes out my CPU in under a second, even though it does nothing but wait. (This code is exaggerated to create the issues, it takes 5-10 minutes to max my robot's cpu)

import rclpy
from rclpy.node import Node

class MoverNode(Node):
    def __init__(self):
        super().__init__("node")
        self.startup_timer = self.create_timer(
            timer_period_sec=1/100,
            callback=self.tmr_callback,
        )

    def tmr_callback(self) -> None:
        # self.get_logger().warn("tik")
        wait = self.create_rate(1000)
        wait.sleep()
        wait.destroy()

def main(args=None):
    rclpy.init()
    node = MoverNode()
    executor = rclpy.executors.MultiThreadedExecutor() 
    executor.add_node(node)
    executor.spin()
    node.destroy_node()
    rclpy.shutdown()

if __name__ == "__main__":
    main()

Expected behavior

This should not use all the cpu. The rates should be properly destroyed when destroy() is called. As a proof, here are alternative implementations without Rate that only use 7% of CPU:

    def tmr_callback(self) -> None:
        time.sleep(1/1000)
    def tmr_callback(self) -> None:
        self.get_clock().sleep_for(Duration(seconds=1/1000))

Actual behavior

The CPU usage explodes and ros becomes laggy. I found that Rate._timer are not properly destroyed. The code below proves that the timer's destruction is the issue by fixing the problem (resulting in 7% CPU load).

    def tmr_callback(self) -> None:
        wait = self.create_rate(1000)
        wait.sleep()
        wait._timer.destroy()
        wait.destroy()

Right now the user should call Rate._timer.destroy() and Rate.destroy() to properly use/destroy a Rate. Which is very confusing.

Additional information

In rclpy/timer.py, line 142 the following code would fix the issue, and properly destroy the Timer.

class Rate:
    ...
    def destroy(self):
        self._timer.destroy()  # new line
        self._is_destroyed = True
        self._event.set()

Let me know if I should do a pull request for this.

2lian commented 4 months ago

I do also wonder if my use of the Rate is wrong. Because this seems like a obvious bug.

clalancette commented 4 months ago

In rclpy/timer.py, line 142 the following code would fix the issue, and properly destroy the Timer.

Well, it definitely should not do that. The timer as passed in there is not owned by the Rate object, it is owned by whatever is creating that (in this case, the Node is the one that calls create_rate). Instead, the user should call node.destroy_rate(rate) to properly destroy a previously created Rate object. Does that work for you?

2lian commented 4 months ago

Ok I understood how to use it now thank you. I did the same benchmark with the code below. Conclusion: Yes this works as intended and the CPU usage stays low.

def tmr_callback(self) -> None:
    myRate = self.create_rate(1000)
    myRate.sleep()
    self.destroy_rate(myRate)

May I suggest to adapt the documentation to avoid users not properly destroying Rates in the future? On the internet the documentation is lacking, and I usually work by opening the objects with my LSP, then having a quick look of the code available inside. Node.create_rate() and Rate both never mention the existence of Node.destroy_rate(), but Rate.destroy() is here in plain sight.

Suggestions:

  1. Add """This function should be handled by the node. See Node.destroy_rate(Rate)""" in the Rate.destroy() method's doc-string.
  2. Add """Use Node.destroy_rate(Rate) to destroy the Rate""" in Node.create_rate() doc-string.
  3. (Not so good: Change Rate.destroy() to Rate._destroy(). I know it is not really a private method of Rate, but the user should not use it. It is fairly confusing to have public methods that should not be used, alongside other that should be used like Rate.sleep().)

Maybe those additions should also be done on objects that are handled the same way: Publisher Subscription Client Service Timer GuardCondition?

clalancette commented 4 months ago

Please feel free to open a PR with your 1) and 2) suggestions. I am happy to review that.

2lian commented 4 months ago

Ok got it. I will make the corresponding PR soon. Should I also commit 1) and 2) for similar Publisher Sub etc. objects? Or only Rate?

Thanks a lot for you help.

clalancette commented 4 months ago

Ok got it. I will make the corresponding PR soon. Should I also commit 1) and 2) for similar Publisher Sub etc. objects? Or only Rate?

Up to you. I think it would be nice to have it for the other objects as well, but if you don't have time for that it is fine to just do it for Rate.