nav2_costmap_2d StaticLayer locking problem

dwisse commented 4 years ago

When the StaticLayer receives an incomingMap or incomingUpdate that changes the dimensions of the costmap during a updateMap in the LayeredCostmap, the updateCosts of the StaticLayer can cause getCost of the StaticLayer to do an out of bounds read of the underlying costmap.

The problem is that the dimension of the costmaps should not change between the updateBounds and the updateCosts calls in the updateMap function of the LayeredCostmap.

To resolve this issue, the incomingMap and incomingUpdate should lock the mutex of the layeredCostmap instead of it's local mutex. Then also the local locks in updateBounds and processMap can be removed

SteveMacenski commented 4 years ago

incomingMap and incomingUpdate should lock the mutex of the layeredCostmap instead of it's local mutex

That doesn't seem right to lock the entire costmap any time a single layer gets some data sources. Shouldn't you just lock the local mutex at the start of updateCosts and unlock at the end of updateMap?

PR appreciated

dwisse commented 4 years ago

If locking the layeredCostmap is too much. It can also be solved by locking the local mutex at the beginning of updateBounds and releasing it at the end of updateCosts in the StaticLayer. And lock the local mutex in the incomingMap/incomingUpdate of the local costmap. incomingMap/incomingUpdate should not be executed between updateBounds and updateCost

The first solution is less complex, but maybe a little bit less efficient than the second solution

For which solution do you prefer a PR?

On Mon, Apr 6, 2020 at 9:31 PM Steve Macenski notifications@github.com wrote:

incomingMap and incomingUpdate should lock the mutex of the layeredCostmap instead of it's local mutex

That doesn't seem right to lock the entire costmap any time a single layer gets some data sources. Shouldn't you just lock the local mutex at the start of updateCosts and unlock at the end of updateMap?

PR appreciated

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ros-planning/navigation2/issues/1625#issuecomment-609992843, or unsubscribe https://github.com/notifications/unsubscribe-auth/APCLV437WKQ322D6RD5WMMLRLIU2HANCNFSM4MCACAFQ .

SteveMacenski commented 4 years ago

The mutex is locked for the 2 callbacks as well as the update methods. I think you just need to lock on the update bounds and unlock on update costs since those are essentially called one after another.

That seems like it can create some serious thread contention, also yours will too, but I think the thought is that this isn't happening all that often (once every few seconds at most in SLAM). Another solution would be to have a costmap / update buffer you build up and then process on the updateBounds call but then you're owning all that compute in the timed loop of costmap updating.

dwisse commented 4 years ago

I've found a problem using the locking in updateBounds and unlocking in updateCosts approach. When updateBounds doesn't detect a change in the costmap, updateCosts is never called, and the mutex is never going to be unlock. Could make a explicit method unlock in the costmap interface, but that doesn't look nice.

On Mon, Apr 6, 2020 at 10:35 PM Steve Macenski notifications@github.com wrote:

The mutex is locked for the 2 callbacks as well as the update methods. I think you just need to lock on the update bounds and unlock on update costs since those are essentially called one after another.

That seems like it can create some serious thread contention, also yours will too, but I think the thought is that this isn't happening all that often (once every few seconds at most in SLAM). Another solution would be to have a costmap / update buffer you build up and then process on the updateBounds call but then you're owning all that compute in the timed loop of costmap updating.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ros-planning/navigation2/issues/1625#issuecomment-610022595, or unsubscribe https://github.com/notifications/unsubscribe-auth/APCLV456YKEHX7SKCTIXUHDRLI4IJANCNFSM4MCACAFQ .

SteveMacenski commented 4 years ago

Perhaps in the layered costmap we can layer->getMutex() and lock them externally? We'd then have to remove all the mutexes from the costmap layers itself to ensure no lock out except by external sources (e.g. sensor, map, data updates).

I don't think there's going to be a single clean solution here since we separated these tasks. Maybe @dlu has a suggestion.

I can say I've never really noticed this even while navigating while mapping, but doesn't mean it shouldn't be resolved. I can see though why its not been looked at.

DLu commented 4 years ago

I might be misunderstanding things, but I also wrote this code many moons ago and am by no means a threading/locking expert.

When the static layers gets a new map of a different size, it calls LayeredCostmap::resizeMap
LayeredCostmap::resizeMap locks the Layered Costmap's mutex
So does LayeredCostmap::updateMap

What you said was

When the StaticLayer receives an incomingMap or incomingUpdate that changes the dimensions of the costmap during a updateMap in the LayeredCostmap, the updateCosts of the StaticLayer can cause getCost of the StaticLayer to do an out of bounds read of the underlying costmap.

First, I don't believe that incomingUpdate can change the dimensions, and if it tries to write outside the current dimensions, it should print an error

So in theory, the resizeMap triggered by the incomingMap AND the updateMap/updateCosts should be protected by the LayeredCostmap's mutex. Is that not working in practice?

SteveMacenski commented 4 years ago

@dwisse can you respond to that?

dwisse commented 4 years ago

Looking at processMap, the global costmap is protected by the LayeredCostmap mutex (resizing) and will cause no problem.But the local costmap (rolling costmap) is not, it does a resize of it's own map.Using the standard mapserver this is not a problem as the resolution and dimensions do not change, but we use a mapserver that scales the map (changes the resolution).It first calculates a route on using a low resolution, and then cuts out a piece of the map on a high resolution and calculate the route againThis is done to handle large maps (2km by 2km).

I've now solved it by using the mutex of layercostmap in incomingUpdate, but it could also be solved by treating a local costmap incomingUpdate the same as a global costmap incomingUpdate (and matchSize) What is your opinion?

On Tue, Apr 28, 2020 at 2:33 AM Steve Macenski notifications@github.com wrote:

@dwisse https://github.com/dwisse can you respond to that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ros-planning/navigation2/issues/1625#issuecomment-620307305, or unsubscribe https://github.com/notifications/unsubscribe-auth/APCLV45RCXUN7DLR33EMLITROYP3ZANCNFSM4MCACAFQ .

gimait commented 4 years ago

I think the best would be to lock the mutex from the LayeredCostmap in processMap, in this line. Is that what you have done to make it work, @dwisse? I think it might be the cleanest and least error prone solution.

Otherwise, we could use an atomic bool or some kind of fence to protect the layer from changes beween the start of updateBounds and the end of updateCosts avoiding the deadlock that you found using the StaticLayer's mutex, @dwisse. Something like this:

atomic<bool> costmap_available {true};

updateBounds( ... ) {
  costmap_available.store(false);
  // Do bounds stuff..
 return;
} 

updateCosts( ... ) {
  // Do cost stuff..
  costmap_available.store(true);
  return;
}

// I think this should be done in incomingMap, but I'm not sure if it should also be done in incomingUpdate
incomingMap( ... ) {
  while( costmap_available.load() == false ) {
    wait();
  }
  // Do incomingMap stuff..
}

This would allow to reenter updateBounds while protecting the costmap. If we need to do a lot more than this to fix the issue, I would maybe suggest to start working on a new plugin. Then we could design the new plugin specifically to allow dynamic changes in the costmap resolution.

Any thoughts? Please let me know if I'm missing something.

SteveMacenski commented 4 years ago

having mutex contention block the callback from a new map could be really bad. Costmap 2D is single threaded so for the time it takes to update all of them, none of the other layers will get data and would likely spike the thread's utilization while it tries to get access to the mutex.

gimait commented 4 years ago

I agree, using acquiring the mutex from the layered costmap would not be very efficient. But it would fix this issue, I think. And it can be tricky to work with other kind of fences.

Also, using other thread protection would require more code than I posted (I posted the simplest pseudocode I could think). We could also introduce an auxiliary costmap that can be modified safely by the incomingMap and incomingUpdate callbacks, and update the actual costmap used when entering updateBounds, maybe.

dwisse commented 4 years ago

When the costmap dimensions are changed, the mutex is acquired by resizeMap in processMap. The only difference is that it is acquired a little bit earlier. It also doesn''t happen a lot that you change your map.

I changed the line at https://github.com/ros-planning/navigation2/blob/d51a2d3e444f5a393c511f9843903d3be64f8422/nav2_costmap_2d/plugins/static_layer.cpp#L251 into "std::lock_guard guard(*layeredcostmap->getCostmap()->getMutex());"

On Thu, Jun 25, 2020 at 2:47 AM Aitor Miguel Blanco < notifications@github.com> wrote:

I agree, using acquiring the mutex from the layered costmap would not be very efficient. But it would fix this issue, I think. And it can be tricky to work with other kind of fences.

Also, using other thread protection would require more code than I posted (I posted the simplest pseudocode I could think). We could also introduce an auxiliary costmap that can be modified safely by the incomingMap and incomingUpdate callbacks, and update the actual costmap used when entering updateBounds, maybe.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ros-planning/navigation2/issues/1625#issuecomment-649148519, or unsubscribe https://github.com/notifications/unsubscribe-auth/APCLV4ZEVAY26JTC4SNGTX3RYKNCDANCNFSM4MCACAFQ .

SteveMacenski commented 4 years ago

I don't see any other instances of a costmap layer using the master costmap's mutex to stop something else from happening, that seems like a gross misuse of the individual layers' access to the master costmap.

The best case I can think of is we do one of the following:

Add a bool for processing and while processing we cache the occupancy grid and then update at the end of the processing
We update in the middle of processing but have a flag to use some corrected coordinates (haven't fully thought this one out but seems like something we could do)
We lock the mutex in update + processing start that isn't released until the end of updateCosts rather than scope based (mutex contention issues)

dwisse commented 4 years ago

I would choose the first option

If the first option is chosen, i would suggest a double buffer, like screen buffers. Then only a switch of pointers is needed. It will cost for a short amount of time "double" the memory, but without double the time.

The second option seems very complex code wise and computational wise. Scaling, translating and bounds checking

The third option will have the same performance penalty as acquiring the lock of the layered costmap, As in the update cycle of the layered costmap the lock is acquired of the local costmap. (Then it will become a discussion of semantics)

On Thu, Jun 25, 2020 at 8:53 PM Steve Macenski notifications@github.com wrote:

I don't see any other instances of a costmap layer using the master costmap's mutex to stop something else from happening, that seems like a gross misuse of the individual layers' access to the master costmap.

The best case I can think of is we do one of the following:

Add a bool for processing and while processing we cache the occupancy grid and then update at the end of the processing

We update in the middle of processing but have a flag to use some corrected coordinates (haven't fully thought this one out but seems like something we could do)

We lock the mutex in update + processing start that isn't released until the end of updateCosts rather than scope based (mutex contention issues)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ros-planning/navigation2/issues/1625#issuecomment-649757727, or unsubscribe https://github.com/notifications/unsubscribe-auth/APCLV4ZVDCTL6RDLGN2OSCLRYOMLBANCNFSM4MCACAFQ .

gimait commented 4 years ago

I have now created a PR (#1837) addressing this issue.

At the end I avoided using any atomic variables and went directly with using a buffer variable to store incoming maps. Then, in UpdateBounds the map is updated.

Please let me know what you think.

@dwisse, if you have time it would be great if you could test this with your set up and make sure that it fixes the problem, thanks!

SteveMacenski commented 4 years ago

@dwisse please review the new #1837 updates

ros-navigation / navigation2

nav2_costmap_2d StaticLayer locking problem #1625