Closed GoesM closed 1 month ago
That appears to be the case! I think adding a mutex lock should square that up. The laser scan sub is in a different callback group on another executor https://github.com/ros-navigation/navigation2/blob/main/nav2_amcl/src/amcl_node.cpp#L1510-L1513 so they're not synchronized
well, I think there's a better way to fix it but not adding a mutex lock, because of following facts:
Let's notice that:
some dynamic parameter setting could let reinit_laser = true
, and then do following code lines:
However, we could find here's a lack of laser_scan_filter_ .reset()
before initMessageFilters()
.
As we know, the callback functions called by lase_scan_filter_
would access the memory , which would be freed during initMessageFilters()
, so that a use-after-free bug occurs.
similar to z_rand
, I have found many other dynamic parameters could lead to such UAF bug, I'd share them here as tickets for this ISSUE and I changed the title of this ISSUE:
ros2 param set /amcl beam_skip_distance 0.5
ros2 param set /amcl beam_skip_error_threshold 0.9
ros2 param set /amcl beam_skip_threshold 0.3
ros2 param set /amcl lambda_short 0.1
ros2 param set /amcl laser_likelihood_max_dist 2.0
ros2 param set /amcl laser_max_range 100.0
ros2 param set /amcl laser_min_range -1.0
ros2 param set /amcl sigma_hit 0.2
ros2 param set /amcl transform_tolerance 1.0
ros2 param set /amcl z_hit 0.5
ros2 param set /amcl z_max 0.05
ros2 param set /amcl z_rand 0.5
ros2 param set /amcl z_short 0.05
ros2 param set /amcl laser_model_type likelihood_field
ros2 param set /amcl odom_frame_id odom
ros2 param set /amcl scan_topic scan
ros2 param set /amcl do_beamskip False
ros2 param set /amcl max_beams 60
So you're suggesting that we reset the message filter at the start of the dynamic parameter callback and then re-initialize it at the end to kill the extra thread? I think that would work in practice, but I think the mutex is a bit cleaner and in line with what we've done elsewhere in the stack for consistency.
Also constantly creating / destroying the data reader could uncover some DDS vendor implementation issues since that's really exercising the stability of their solution. I don't know that we'd find any bugs, but that would certainly be a good way to trigger them if there was some memory leaks. So in the back of my mind I'd rather not test it if we have an alternative that seems reasonable. The dynamic parameter callbacks are quick so I think this should be fine to mutex lock them :-)
So you're suggesting that we reset the message filter at the start of the dynamic parameter callback and then re-initialize it at the end to kill the extra thread?
Because I'm considering that the current design approach of nav2_amcl seems to work indeed in this way.
After amcl receiving the instruction to change dynamic parameters, the related threads are closed :
and then these threads would be restarted within the new configuration by initMessageFilters();
Additionally, I've had a try on it , by adding the code laser_scan_filter_.reset();
as following:
// Re-initialize the lasers and it's filters
if (reinit_laser) {
lasers_.clear();
lasers_update_.clear();
frame_to_laser_.clear();
laser_scan_connection_.disconnect();
laser_scan_filter_.reset();
laser_scan_sub_.reset();
initMessageFilters();
}
And then these 18 commands for changing dynamic parameters would work correctly.
So in the back of my mind I'd rather not test it if we have an alternative that seems reasonable. The dynamic parameter callbacks are quick so I think this should be fine to mutex lock them :-)
Sorry, I might not fully understand this method. Do you mean changing the current design, removing all the code that closes the threads, and instead using a locking method?
I feel that I need to understand your suggestion more clearly, and then proceed with my future work based on your recommendation. ^_^
On further thought, what you say makes sense, lets do that!
Bug report
Required Info:
Steps to reproduce issue
Launch the navigation2 normally, as following steps:
Learning about how the dynamic-parameter works , I had a try on it .
[notice] whatever the value be (0.5 or anyother double value43) , the UAF occurs all the time.
Expected behavior
no crash occurs
Actual behavior
The ASAN reporting a heap-user-after-free bug to me as following, and the
nav2_amcl
stop its work.Additional information
it seems that the callback function for laser_scan_message is stiil executing while dynamic parameter changes, and they have some data race.