Open AlexReimann opened 8 years ago
fyi @stonier
Same with the global planner. I think it makes sense to lock the costmaps when about to read them in the planners, so you don't read them right when they are cleaned (I have seen this happens often), but that should be the responsibility of the planners themselves, done in the right moment to minimize the blocking time.
+1
To a large extent, the default planners all assume a consistent world state throughout a planning cycle. I concur that it probably should be in the planner itself, but that is made slightly difficult because while we can update all the default planners, we don't have the ability to update all the planners out there that may be relying on our costmap locking in move_base.
I would suggest that if we want to change this, we only do it in lunar and above if we do it very soon, or we do it in a later release if we cannot fix it ASAP. That limits the impact of the change on operational systems.
Adding help wanted to note that maintainers do not currently plan to tackle this, but PRs are welcome!
FYI, on the new move base flex, we plan to drop this locking and advice plugin developers to adapt their code. As a showcase of why we think this is a good thing, the sbpl_lattice_planner can take an arbitrary time to compute a plan (configurable), while it just need some tens of milliseconds to make his own copy of the global costmap.
I've changed the title of this ticket to the actionable fix we want to implement
Can I ask for some clarification on the motivation here, please?
Firstly, it's the move_base node which combines perception and planning. Therefore, it is the place that knows whether there is a possibility for conflict on shared resources, and thus it should do the locking. It's a fairly easy mod to make the locking go away completely, for example.
Secondly, I don't see the point. @AlexReimann wrote this "may be impacting performance". What is meant by that? Given that the planner is the only consumer of the costmap, locking caused by the planner itself impacts nobody else.
@corot wrote that the global planner can take very long and that it's cheaper to copy the costmap. Firstly, that depends on the size of global costmap -- when it's large, copying is significant cost. We regularly run move_base in environments of 500mx500m at 10cm resolution. That's about 23MB of data per layer. Moreover, I again wonder what the problem is -- nobody but the global planner uses the costmap, so why worry that it's not being updated? Are you concerned with the CPU-spike due to batching the updates?
Yeah that are some valid points if you run all standard ros navi stack.
Once you go away from that and e.g want to use the costmaps also from other modules then, nope. As far as I know most companies use a modified version of the navi stack, maybe switching out planners, using a different costmap layers and so on.
Even if you just switch out planners, having the costmap locks in the move_base above the planners makes it impossible to do any smarter locking.
Anyways for me this is actually not an issue anymore as we gradually moved away from the standard costmaps. Also I think I will try to avoid working much with the navi stack in the future ... it really is just good for early prototypes.
Maybe interesting for you: That copy time issue can be avoided with some kind of double / multi buffered costmap using differential updates. Basically you keep a costmap copy and just update it differentially while the originally costmap runs at a higher update rate.
@AlexReimann Can you say a bit more about which performance issues you're seeing, concretely? It would help us understand what the requirements of those "other modules" are, that require a different locking strategy.
From my side, I would say, locking in a fine-grained manner is not better in general. It is a guiding principle to hold a lock as short as possible, but this is just one of several consideration to take into account. To give just one example, every lock operation is a problem for real-time processing -- you can get preempted on mutex calls.
Regarding that most companies use a modified version of the navstack: That is also what I observe. In fact, several of them (including my own) have completely independent implementations of a navigation architecture.
That said, the standard move_base could still have considerable value for interactions between companies and academia, as well as providing a baseline, so it's not pointless to improve it. Some people have suggested replacements for the overall move_base node, but this issue goes beyond, by also considering the assignment of responsibilities between modules.
Other process is also using the costmap, global or local planner is taking long and thus slows down this unrelated process down.
If the locking is planner specific it is up to the planner to lock it the whole time or only every time it is accessing it. Architecture wise it would make sense to have locking and non-locking access (depending on the interface you use) and getCopy functions for the costmap to make everybody happy. But again that's stuff for a re-write.
@AlexReimann From my own benchmarking on DWA-derived local planners, the part that takes the longest overall is the scoring -- and this needs read access to the costmap. Also, scoring needs to compare different trajectories, so it mustn't change inbetween. Fine-grained locking has no benefit here.
For local planners where the trajectory generation step takes longer (e.g., due to use of a more complex robot model), we might see more of a benefit. I don't have any of those available for benchmarking right now, but if that's what you guys do, you could profile it and let us know.
That said instead of moving locking into the planner, it would also be possible to extend the local planner interface to have stages, where one stage needs costmap access and others maybe don't. This would enable much more parallelism, also for the planner stage, not just for other processes.
An advantage of this would be that it could be done backwards-compatible for everything derived from base_local_planner, because we can add empty methods for these stages (as opposed to moving locks, which would require modification of all the planner implementations).
Another approach could be to use reader-writer locks, but this wouldn't enable as much parallelism, due to the serialization constraint between costmap updates and planning.
I also don't have anything for benchmarking available anymore. It is a no-issue now for us but I think still would be an good improvement (although I am strongly for a full re-write of this thing).
Yeah the stages approach would work for the sake of backwards compatibility.
@AlexReimann will you be at ROSCon? If yes, how about a chat?
Nope, I sadly won't be at ROSCon. Also might tune out of this issue as I will quit the company soonish and have no idea what I will work on next :>
I @iluetkeb, I'll be at ROSCON; hope we will have plenty of time to talk about navigation!
For the example about the global planner I mentioned above, believe me, we have VERY big environments to deal with. And making the local, planner-specific copy of the costmap takes negligible time compare to the planning itself.
FYI, and sorry for the off-topic: talking about alternative navigation frameworks, we are presenting our own at ROSCON: https://github.com/magazino/move_base_flex
@corot I guess it depends on your planning architecture. We typically first use topological planning for planning in the large, and then a costly planner (not from SBPL exactly, but close enough) only for relatively short paths, between the nodes of the topological graph. In this situation, planning doesn't take so long, and the overhead of copying the map would be considerable, I surmise.
btw, I'm aware of MBL, and I like the approach of taking the state-machine out of move_base, and switching to an action-based interface instead. I would go even further and split up local and global planning into separate nodes -- putting them into one made sense as long as local and global planning were tightly coupled, but since you've removed that coupling anyway, I don't think its necessary anymore. But we can talk about that length at ROSCon ;-)
During the execute cycle move base locks the local costmap in the controlling state without actually accessing the costmap.
What's the reasoning behind this?
It seems kind of to be done in the case the local planner wants to do something with the local costmap in the computeVelocityCommands function. This seems to be overkill and maybe restricting the performance.
In our case the computeVelocityCmmands sometimes takes a while, but we don't keep the updating of the local costmap locked during that time.
So.. could it be removed?