Closed ysl-design closed 2 years ago
Thanks for this detailed error report.
I think the clean solution would be to stop calling processMessage
/ waiting for a running call to finish before destructing the point cloud.
Thank you for your response and for the corrections to the issues in my description. I tested it with the modifications in PR #1754 to compile RViz, and the problem (1) in my description has been fixed. However, problems (2) and (3) in the description still exist during the test. I think that even though it stops subscribing to new data before destruction, threads that have entered processMessage will continue after the PointCloudCommon destruction is complete, and I suspect this may be the cause of the crash during mutex destruction.
I think that even though it stops subscribing to new data before destruction, threads that have entered processMessage will continue after the PointCloudCommon destruction is complete, and I suspect this may be the cause of the crash during mutex destruction.
Yeah, that might be an issue. Quoting the doc: Attempting to destroy a locked mutex results in undefined behavior. I have pushed another commit to ensure the mutexes are held by the destructor...
Thank you for your reply and modification. I'm sorry I didn't get back to you in time. I tried new modifications and it looks like the problem (3) in the description has been fixed as well. But problem (2) still exists. According to the analysis of the core dump file again, problem (2) occurs in this case: After PointCloud2Display is destructed, its parent class MessageFilterDisplay is destroyed. A crash occurred while executing delete tffilter in the ~MessageFilterDisplay() function. The backtrace information ultimately points to the Signal1 class in the messagefilters of the ROS. The mutex in the Signal1 class was destroyed without being unlocked, resulting in a crash. Can this problem be avoided when MessageFilterDisplay is destructed or the code related to ROS needs to be modified?
Thanks for the feedback. The MessageFilter destructor correctly disconnects as expected:
~MessageFilter()
{
message_connection_.disconnect();
MessageFilter::clear();
}
Could you try to build with these cmake flags and post the resulting backtrace(s) when just running rviz:
-DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -O1"
This enables the address sanitizer, which in detail tracks allocated and freed memory.
I used these cmake flags you gave to build rviz, and amazingly, I couldn't reproduce problem (2) in the description while running rviz, I tried several times and never crashed. After I shut down rviz normally, the terminal displays "ERROR: LeakSanitizer: detected memory leaks" . There will be a lot of printed information, and I don't know which information to post. And the information doesn't seem very relevant to my question here.
The memory leaks reported are not related to your issue. If you want to report some, only consider those related to rviz. There are many low-level libraries having leaks, which we cannot fix anyway.
That you can't reproduce the issue (2) anymore might be related to the slower execution with asan
.
I don't understand why the mutex_
in MessageFilter/Signal1
can still become locked after disconnecting the message_connection_
(which should stop pushing new messages) and clearing the message buffer. I was hoping for more insight with asan...
Could you build with -DCMAKE_BUILD_TYPE=RelWithDebInfo
again and check which thread is holding the lock at crash time? Please paste the backtrace of this thread (which should include message_filters::Signal1::call()
).
I used -DCMAKE_BUILD_TYPE=RelWithDebInfo
to build rviz, and again, it no longer crashes. So I'm afraid I can't provide the information you want.
I remove these compilation parameters and compile and run rviz, and it crashes again. I added print statements before and after mutex is locked and unlocked in the Signal1 class. According to the print information, after mutex is locked in the void Signal1::call(const ros::MessageEvent<M const>& event)
function of signal1, mutex is not unlocked until the rviz crashes. According to the core dump file, the crash occurred during the destruction of the locked mutex.
The following is the cout
result. At the end of the Signal1::call
function, the cout
statement "call unlock" is not printed, indicating that mutex_ is not unlocked.
call lock 0x55ce07f77a68
...
~PointCloudCommon()
~PointCloud2Display() 0x55ce0754d100
~MessageFilter() start
removeCallback lock 0x55ce0754d330
removeCallback unlock 0x55ce0754d330
~MessageFilter() end
rviz: /usr/include/boost/thread/pthread/mutex.hpp:111:boost::mutex::~mutex(): Assertion '!res' failed
Aborted (core dumped)
According to other printed information, mutex_ is locked in the Signal1::call
function, then, the program runs to helper->call(event, nonconst_force_copy)
and crashes.
According to the core dump file, the address of the object when the crash occurs is 0x55ce07f77a68, which is the same as the address of the object that invokes the Signal1::call
function in the cout
information.
...
#4 0x00007fb805091160 in boost::mutex::~mutex() (this=0x55ce07f77a68, __in_chrg=<optimized out>) at /usr/include/boost/thread/pthread/mutex.hpp:111
...
This should indicate that the crash occurred in the Signal1::call function and that mutex_ was not unlocked.
Thanks for your investigation. I continued as well and traced the issue down to tf2_ros::MessageFilter
.
I filed a PR https://github.com/ros/geometry2/pull/538.
I used
-DCMAKE_BUILD_TYPE=RelWithDebInfo
to build rviz, and again, it no longer crashes.
A release build disables all assertions. Hence, it is not aborting anymore (due to failing assertions).
Fixed via #1754
Describe your issue here and explain how to reproduce it.
The description may be a little too much, please be patient to read it : )
Your environment
My scenario: Hi, I've added a dozen display plugins to rviz, including pointcloud2, marker, markerArray, etc. I then saved the settings to the xxx.rviz file. Later, I loaded the xxx.rviz file several times and opened the file through ‘File -> Open Config’. (The corresponding data is still being sent when the config file is switched.) I found that rviz occasionally crashed.
I found that there are three reasons for crashing, all related to the PointCloudCommon class: (1) Based on the backtrace and code analysis, it is found that an emitTimeSignal signal is sent in the PointCloudCommon::processMessage function. This signal transfers the pointer pointing to the pointcloud2 plug-in to the TimePanel::onTimeSignal function. In some cases, the pointcloud2 plug-in is destroyed before TimePanel::onTimeSignal is executed. As a result, the display pointer transferred to the TimePanel::onTimeSignal function becomes invalid, and a segment fault occurs when an invalid memory is accessed. I wonder if you can avoid this by adding a judgment at the top of the TimePanel::onTimeSignal function that determines whether sender() is a null pointer.
the backtrace shows that:
(2)Another possible cause of the crash is that PointCloudCommon has been destructed and the mutex new_cloudsmutex has been destroyed. However, the lock operation is still performed in the PointCloudCommon::processMessage function, leading to the crash.
the backtrace shows that:
(3) The last possible cause of the crash is that the mutex transformersmutex is locked and not unlocked in the PointCloudCommon::transformCloud function, while PointCloudCommon is destructed, causing the crash.
the backtrace shows that: