Closed GoesM closed 11 months ago
I don't understand what you think is the problem.
waitForCostmap
is called once the system is lifecycle activated and processing. All plugins and filters have been initialized and started processing at this point. There should be no situation where these pointers are invalid.
They are not, however, thread locked it does not appear, so perhaps you're referring to memory corruptions with enabled_
and current_
?
Your report shows that one of the traces shows on_configure
- which I see no possible way waitForCostmap
is called in that method or how that could possibly relate to repeated clearing costmap operations.
Thus, your report to me is unclear and I don't see a potential problem / resolution without some more information from you @GoesM
we've confirmed that sometimes members in layered_costmap_
( like plugins_
, filters_
) in nav2_planner/src/planner_server.cpp
were changed into invalid pointers after the function isCurrent()
was referenced, and then it always lead to a crash of the program.
bool isCurrent()
{
return layered_costmap_->isCurrent();
}
Sorry that our question wasn't specific enough. We will check and provide more details, Thanks for replying.
here's our PR #3958, to decribe the bug more clearly and the solution. Check it please : ) @SteveMacenski
Closing ticket, moving discussion to PR which should be merged after you fix some linting / rebase
<1> Bug Description
Bug Type: NullPtr referenced
nav2_planner referenced a NULL-Pointer of nav2_costmap_2d
workstation_environment set
code location
/navigation2/nav2_costmap_2d/src/layered_costmap.cpp
/navigation2/nav2_planner/src/planner_server.cpp
function
isCurrent()
from/navigation2/nav2_costmap_2d/src/layered_costmap.cpp
isCurrent()
is accessed by/navigation2/nav2_planner/src/planner_server.cpp
<2> References [ log_files ]
More details are provided here.
function calling stack [by Asan report]
planner_server work_log [by ros_node_log]
We have met the same bug more than 50 times totally, just in one week
Each time, the planner_server met the bug after a [INFO] as: "Received request to clear entirely the global_costmap",
The following is representative log_files of planner_server from these 50 attempts
at this line, the planner_server shutdown suddenly.
<3> Analysis
In what situations bugs would happen?
When executing instructions [ nav2_goal action sent by user ], planner-node need the current costmap-result so that would access the pointer variable [ costmapros ] from nav2_costmap_2d
However, it seems that there's no checks by planner_server before accessing the pointer,
There may be a coincidental collision causing the bug:
nav2_costmap_2d node happens to have reached its lifecycle or is undergoing recalculation due to changes in sensors_msg (like odom and scan) . At this stage, the pointer may have been released, and at this point, the planner_server coincidentally called its pointer for the need of executing an action instruction , finally resulting in a null pointer access.
<4> POC design
This seems to be a concurrency problem caused by multithreaded execution, with a high frequency of bugs but also full of coincidences, so we cannot provide a 100% successful POC design. But we can provide some ideas to try and trigger the bug:
Method 1:
At present, it seems that when a certain situation occurs (such as changes in sensor information or end of life cycle), it will lead to nav2 costmap 2d's operation -- "clear entirely the global_costmap".
Therefore, it could be tried:
when program executing nav2goal aciton normally, we could send some interference sensor messages, constantly make nav2 costmap_2d perform the operations of clearing or recalculating, at the same time observing if the same bug will occur.
Method 2:
this method is just used for checking the bug, but not a real POC:
when program executing nav2_goal aciton normally, try to restart nodes related to nav2costmap 2d;