Closed guru-florida closed 3 years ago
Looks like the associated PR has been merged. Can this be closed?
I didn't follow the entire discussion there but sounds like this is an okay fix for now, the more "appropriate" fix being more involved. If needed, a summarizing comment can be added here (and I don't know enough to write that comment), or we can close this and open a followup issue.
Yes, fix as described here was merged. We discussed other options but opted for this simpler fix for now. Agreed, this ticket can be closed.
@mabelzhang @guru-florida thanks for checking and comments, it's been closed.
Bug report
The SEGV is caused due to some sub-iface destructor code expecting living references to the node base or other interface but it was already released earlier in the Node destructor. Memory corruption occurs and eventually malloc checks abort later in the execution, usually somewhere in malloc_consolidate, FastRTPS library or waitfor collection resizing.
The destructor for the Node class is empty. Sure, all the sub-interfaces are shared_ptr and thus their destructors are called but many of these sub-interfaces reference each other. For example, all of the sub-interfaces reference at least the NodeBaseInterface so it must be released last. Furthermore, clock and time_source sub ifaces reference many others so those other dependencies should also stay alive until the clock and time_source is released.
NodeParametersInterface For example, the NodeParametersInterface contains parameter service calls that captures a weak reference to the base interface. During destruction it de-references the weak reference, finds it is null and spits out this error:
NodeParametersInterface at least uses as weakptr to base iface and thus fails gracefully (with a leak) but I expect some of the other sub ifaces may not. I see many examples in my searches of similar crash backtraces as I've encountered, explained below.
Steps to reproduce issue
The nature of these types of mem corruption can be hard to pin down since the debugger/malloc doesnt trip until much later in the execution. I encountered this error while creating and destroying many Nodes at runtime. Specifically, I wrote a node to do episodic parameter tuning/training for a humanoid robot in Gazebo simulation. For each episode, ROS nodes are created then torn-down (cycled). The SEGV would occur somewhere between episode 1 and ~15 and always occur immediately after a tear-down. It got progressively worse as I added more nodes to simulation that are cycled per episode. I would say any such program that continuously creates and destroys nodes will replicate this issue, such as the one in Issue #447 (I will test this and report).
Related Issues
I believe these issues may have encountered the same problem. In particular, the SEGV often happened in FastRTPS but I dont believe the problem is there, it just happens to be the next malloc operation. I encountered a number of crashes in wait_for_work related code.
Fix
In the destructor I added _sub_iface.reset() calls for each of the sub interfaces in reverse order as in the constructor. I was able to get over 500 episodes without crashes. I believe explicitly specifying the order of destruction is a safer way of unravelling the tear-down dependencies.
PR incoming.