Open basicNew opened 6 years ago
Alright, we have an update on this end.
I had to take Python out of the way (had issues forcing CMake and Bazel to use a custom built interpreter for profiling), so I wrote a couple demo replicas (namely, gazoo
and dragway
) in pure C++ (see branch for reference). Then, I spent quite an amount of time fighting different profilers, each with it's own pros and cons:
callgrind
, a valgrind
plugin, gives one a fair idea of CPU costs, but not without huge performance penalties and disrupting normal thread execution (that gets serialized).google-perftools
shows improved performance when compared to valgrind
, but the CPU profiler is way too coarse. Its heap profiler is helpful though.gprof
would've been great to get a better sense of total call counts (we even have a custom CMAKE_BUILD_TYPE
, Profile
, in both ignition
and delphyne
for this!), but it doesn't play well with dynamically linked executables :(. So far it has failed to see symbols outside the executable itself.Interestingly enough, I'm under the impression that the main performance issue comes from Drake
's framework itself.
Every time a System
evaluates an input port, the current implementation queries the output port it is connected to, which in turn may trigger further evaluations, effectively traversing the relevant branches of the Diagram
the system is part of. That we know. But the thing is, each and every query involves a call to the associated Calc...
function of the port, there's no caching of the query result yet. Thus, depending on the wiring, you may end up recomputing the same values many many times[1, 2], which is bad when doing so is expensive (like it is for e.g. the MobilPlanner
system used by our MobilCar
agent). On top of this, because output port value calculation is usually preceded by allocation, you may also hit the heap way more than it is necessary [2].
The good thing is that there's ongoing work to improve this, I was referred to https://github.com/RobotLocomotion/drake/pull/7668 when I asked about it on Slack.
[1] PoseAggregator
triggering millions of calls (valgrind, callgrind plugin)
[2] Eigen
hitting the heap, hard (google-perftools, heap-profiler).
Total: 5983.4 MB
3546.2 59.3% 59.3% 3546.2 59.3% Eigen::internal::aligned_malloc
1787.2 29.9% 89.1% 1787.2 29.9% drake::multibody::collision::BulletModel::CollidingPoints
200.4 3.3% 92.5% 200.5 3.4% drake::systems::rendering::FrameVelocity::FrameVelocity
122.1 2.0% 94.5% 122.1 2.0% drake::systems::LeafCompositeEventCollection::LeafCompositeEventCollection
39.9 0.7% 95.2% 39.9 0.7% drake::systems::DiagramCompositeEventCollection::DiagramCompositeEventCollection
39.5 0.7% 95.9% 3585.2 59.9% std::vector::_M_emplace_back_aux
35.7 0.6% 96.4% 37.5 0.6% drake::hash_append
31.3 0.5% 97.0% 159.4 2.7% drake::automotive::PoseSelector::FindClosestPair
21.6 0.4% 97.3% 143.3 2.4% drake::systems::LeafSystem::AllocateCompositeEventCollection
15.3 0.3% 97.6% 16.1 0.3% drake::systems::rendering::PoseBundle::PoseBundle
14.4 0.2% 97.8% 14.4 0.2% Eigen::internal::conditional_aligned_new_auto
12.4 0.2% 98.0% 12.4 0.2% std::__cxx11::basic_string::_M_assign
7.8 0.1% 98.2% 7.8 0.1% Eigen::internal::call_dense_assignment_loop
7.3 0.1% 98.3% 7.6 0.1% drake::automotive::CarVisApplicator::get_load_robot_message
First column on the left is the total amount of memory, in MB, allocated by the named function during program execution (not to be confused with the total amount of memory in use, which is the total amount of memory allocated minus the total amount of memory freed so far).
Super interesting @hidmic ! Now, can we implement a hardcoded, super an-hoc caching mechanism (e.g. in PoseAggregator
) to see if we can confirm your hypothesis? I would hate to wait until Drake PRs are merged only to find out that it does not fix our issue.
@basicNew Hmm, which hypothesis do you want to validate?
We can confirm these re-computations do take place with a dummy diagram like the one below, checking that evaluating E's input once triggers B's output computation twice:
A +-------> C+----------+
^ v
| E
| ^
B +------------->D+-----+
It's (way) trickier to validate that these duplicated computations are the (sole?) root cause for MOBIL cars' degraded performance though.
Doing some very coarse caching in the PoseAggregator
, as you suggest, reduces CPU load but it makes it more hungry for RAM (always in total allocated memory, I presume because of the extra copy of the bundle since lots of heap-allocated Eigen
matrices are being moved around). To put it in numbers, compare the total reduction in callee call counts. Also, having 60 MOBIL cars distributed on a 4-lane dragway makes /visualizer/scene_updates
topic messages go out at ~20 Hz (realtime) without the change and at ~40Hz (realtime) with the change.
Still, cache support isn't reaching each MOBIL car (sub)diagram (and those are quite convoluted, so I'd imagine it suffers from this as well).
Finally, taking a look below automotive
, I would not discard that part of the implementation may not scale that well as more cars get added to the world (e.g. see PoseSelector
system implementation). But that's just a hunch, I have not profiled nor studied the complexity of each component individually.
Yes, my point here was to do some very rough approximation to see if, by caching, we were getting any of the expected benefits; doubling the update frequency goes a promising direction. So, how do you suggest we move forward with this?
@basicNew What if we reach out to the author of https://github.com/RobotLocomotion/drake/pull/7668 to see how we can give this new caching support a shot (locally I mean)? If it works[*], I wouldn't invest more time on this (at least for M3) and I'd push/wait for the feature to get released.
If it doesn't work (i.e. there're more bottlenecks, but in the systems themselves), we'll have to benchmark each of the (relevant) automotive
systems in isolation. I'm sure there are aspects of the implementation and/or the algorithms that could use an improvement. It's a bigger task though.
[*] BTW, I see that we have goal of at least 20 agents @ 100 Hz, but I think we're lacking a target host. What kind of hardware are we expecting Delphyne users to have?
Avidly watching these performance threads. Certainly would like to learn how exactly Drake systems is doing it's computation. Can you jot notes as you discover things @hidmic? I've found precious little other documentation (design notes or user documentation) around. Have resorted to just learning by discovery over the last couple of weeks.
Realistically, I think we should be able to nail that target on a decent laptop (for the backend at least). It really shouldn't be a thrashing machine - we used to run a far more intensive vision slam on an very low spec'd arm core.
If we're going to seriously work on the systems, might also be time to just port them all in from drake::automotive
so we can work on them rapidly here. I started doing that for the rail follower in branch stonier/rail_follower.
Can work out what to do with them later once we've done surgery on them.
@hidmic I think reaching out to Sherm would be a good thing to do, both to understand ETAs and to let him know we are also interested in these improvements. Also, it would be great if we could get a branch to start benchmarking MOBIL cars with the system cache on.
Another place were we could start digging is in maliput queries (we talked about this a couple of weeks ago). I think the dragway implementation should be pretty straightforward, so I'm not expecting much penalty there, but definitely worth checking (a different story will be multilane, but we can deal with that once we get there).
Finally, I like the idea of looking inside the MOBIL car and understand what are the major bottlenecks; e.g. a MOBIL car uses a MobilPlanner
, an IdmController
, a PurePursuitController
, etc. We could start by comparing them and checking if something stands out.
@basicNew @stonier I spoke to Michael Sherman, the author of https://github.com/RobotLocomotion/drake/pull/7668 (tracked by https://github.com/RobotLocomotion/drake/issues/4364). ETA was (roughly) a couple weeks for the entire caching support to merged into master. Latest related PR is https://github.com/RobotLocomotion/drake/pull/9067. He also gave me green light to pull his feature branch and test. Which I've done.
Again, having 60 MOBIL cars distributed on a 4-lane dragway makes /visualizer/scene_updates
topic messages go out at ~20 Hz (realtime) with caching disabled and at ~60 Hz (realtime) with caching enabled.
Now most of the time seems to be spent within MobilePlanner::CalcLaneDirection
, specifically when it makes calls to the PoseSelector
class.
I'll see if I can improve it or at least pin point the main issue.
Great, thanks for the heads up @hidmic ! 3x with caching and a single method consuming that much is a very promising start! It would be great to see what we can change in CalcLaneDirection
to further improve things.
@basicNew @stonier Ok, confirmed. PoseSelector
's implementation is another major bottleneck. I did a quick local hack of the main loop inside PoseSelector::FindSingleClosestInDefaultPath()
. Namely, traffic_isometry
can be a reference and pasting IsWithinLane()
code in the loop removes an extra ToLanePosition()
call and removes the need to repeatedly query for RoadGeometry
linear tolerance.
With just that, we get messages at ~75 Hz with the same 60 MOBIL cars in the same 4-lane dragway. Looking through the code, there are many places where queries are duplicated for the sake of code clarity and locality.
Incidentally, while trying to speed up valgrind
, I replaced all Prius meshes by boxes. Message frequency got close to ~90 Hz. The CarVisApplicator
may be another (medium sized) player.
It remains to be seen how this winds up, but evidence seems to suggest that we are pushing the limits of automotive
's implementation (if not Drake's). In the short-term, we can do some surgery to PoseSelector
's implementation. But moving forward, I think it'd be helpful to have a clear goal to know how deep we're going in our optimization crusade :).
Glad to see caching was helpful! Thanks for the beautifully-executed experiment, @hidmic.
@basicNew @stonier do we have green light to start PoseSelector
refactor locally? Or do you guys want to wait for official caching support? (BTW thank you @sherm1 for the feature!)
I would vote for starting the PoseSelector
refactor now.
@basicNew @stonier Update! I spent quite some time understanding the PoseSelector
and trying to reduce its overhead (rearranging code, reducing repeated calls, etc.). But so far I've failed to go beyond that last ~75 Hz metric I reported (with caching enabled on the Drake's side of course). Here's the branch with the latest additions below our systems
directory. The previous branch with C++ demos to profile was rebased on top.
The problem lies on the fact that both MobilPlanner
and IdmController
systems make multiple calls to PoseSelector
functions (static methods to be more accurate), that in turn perform exhaustive searches (e.g. every car is going through all other cars checking for which one is ahead of it on every single step).
So to get that final boost, we need whole new algorithms. We can leverage the knowledge of other cars' pose, velocity and acceleration to skip checks for the ones that won't be close enough soon (asumming cars don't teleport, which might break on e.g. dragways, that do have such device).
Let me know if we have green light to start researching on the topic.
hmmh, interesting. Some time ago the idea of a "swarm" controller (as opposed to single-car controllers) was brought up by @stonier as a way to be more efficient. Maybe we can do a time-bounded thing (2 days top) to do some research in this direction, suing this particular computation as a concrete example of performance optimization?
On the other hand, with current optimizations I think we are ok for the demos we want to run, so I would hate to keep investing time on something that is not required for this milestone.
What do you think @stonier ?
Sorry for the delay, missed the notification from the first question last week.
Matt/Jon had mentioned something like that which you describe, but neither had done any digging. What are the exhaustive searches that are done in turn?
@basicNew it wasn't exactly a swarm controller, though that is an option. I was thinking about simulator-level systems to expediate calculations on data from the other agent systems + pose aggregator. Does this make sense here? Or would it be redundant if there were no mobil cars?
With regards to decision making, I think we should go forward with pulling the agent systems/states/params and pose selector into delphyne (in the spirit of #498 and PR 499). We are going to have to dissect those pretty hard anyway. Once they're in, we can at least put proposals in a doc - that would be good to have while it's fresh in everybody's mind, regardless of whether we do them for M3 or not.
@stonier Sorry too, I did not give you guys any pointers to the actual code. I'll use our version of these systems, as I've removed silly duplications and weird data structures (e.g. LaneDirection
) there already.
The MobilPlanner
system queries the current and side lanes (if any). On each query, it scans behind and ahead of the ego car. For each scan direction, it follows the the default lane path (using default ongoing branches or the first found, see here), looking for the closest traffic car in that path, essentially querying them all. That happens on every simulation step.
The IdmController
system only scans ahead in the current lane, but it's currently looking for branches too, so from each traffic car position (that is in turn found by a lookup throughout the entire road geometry) it traverses the road back to the ego car to find the closest one. That happens on every simulation step.
So far, running a dragway-based demo on top of valgrind
shows that PoseSelector
calls from the MobilPlanner
make up for roughly ~30% of the CPU load. Methods like Lane::ToLanePositionT
, which are rather inexpensive, are called millions of times and thus become relevant. I can only imagine how hard would be the hit on a more complex road (valgrind
's performance penalty is so large that it takes ages just to load a multilane road mesh).
@basicNew @stonier Using a swarm controller makes sense only for MOBIL cars right now. We'd still need to figure out how to optimize away redundant computations in that case.
I spent some time thinking on how to overcome these issues in a (relatively) simple way. If we could ban teleportation (modifying dragway
implementation for starters), and we had knowledge of maximum velocities and accelerations, then we could narrow the search domain. At a (very) high level, we could:
(traffic_distance - scan_distance) / (2 * top_speed)
, to account for the relative speed between the two cars, or using headings, whatever works best).And only then perform the kind of lookups that we're using now (reusing as much computations from previous steps as possible). PoseSelector
will no longer be stateless.
But as you can see, it's a nontrivial task.
+1 to pull in these classes. Clean ones from Drake or the modified ones in this branch? I can start a document with above's brain dump too.
Design doc opened.
I added some notes in the doc. Maybe a chat on hangout would expedite things?
As for moving over the systems, hard to see what you changed in that branch since it's in the one commit (i.e. all new). When bringing them in, just bring in them in raw in one commit, then changes in additional commits and don't squash the end result so we can trace what we've been doing. Also for RailFollower, take what's here in delphyne now and add the changes on top f that (i.e. LaneDirection etc).
We can probably do this incrementally over a few PRs. I was actually going to bring in one of the systems tonight till you distracted me with all the updates :)
@stonier Thanks! Replied all your comments. We can chat whenever you want/can.
Noted. I'll open a PR that brings the classes fresh with no changes in the first commit and then applies all changes in a second commit.
Alright, #501 opened. Also, #503 and ToyotaResearchInstitute/delphyne-gui#140 introduce a small demo that, by default, puts 20 MOBIL cars on common roads. The idea is to use it from now on to check where we're in terms of performance and also to get a grasp of the impact hardware variations have (at least among us Delphyne devs). If you guys can run it and report stats back, that'd be great.
Also, based on a brief discussion I had w/ @basicNew, I went on to try gazoo
demo too, with the same 60 MOBIL cars, though not on top of valgrind
(as I said before, the performance penalty is insane, even at the road geometry mesh generation step). Results were less encouraging, though not entirely unexpected:
Computation Policy | Build Type | Scene Update Frequency |
---|---|---|
prefer-accuracy | debug | ~7 Hz |
prefer-accuracy | release | ~10 Hz |
prefer-speed | debug | ~14 Hz |
prefer-speed | release | ~20 Hz |
Caching was enabled on Drake's side. This reinforces the fact that Multilane's is slow if always accurate (FYI solutions for this are in the works).
@basicNew @stonier given that we know that:
I'd say we can close this issue. We can open up another issue to track MOBIL car system improvements. What do you think?
I would prefer to keep this one open until we close all the other issues.
About the perf demo (+100 by the way). What are your stats on that? (i.e. dragway vs linear multilane vs arc multilane)
Heads-up re caching: Drake PR #9113 is in review. When it lands (likely next week), Drake master will have caching but it will be disabled by default. context.EnableCaching()
turns it on. A subsequent PR will change the default.
That's awesome @sherm1! Thank you for the FYI.
@basicNew sorry for the delay to reply, I had to rebuild everything for Release from scratch, using Drake caching plus the latest MOBIL classes refactor. My stats are shown below:
For a dragway.
michel@abydos:~/Workspaces/delphyne_ws$ delphyne-mobil-perf dragway
Running simulation for 5.0 seconds.
Simulation ended. I'm happy, you should be too.
= Simulation stats ==========================
Simulation runs: 1
Simulation steps: 5000
Elapsed simulation time: 5.0s
Elapsed real time: 7.580936824s
=============================================
For a straight Multilane road.
michel@abydos:~/Workspaces/delphyne_ws$ delphyne-mobil-perf straight_lanes
Running simulation for 5.0 seconds.
Simulation ended. I'm happy, you should be too.
= Simulation stats ==========================
Simulation runs: 1
Simulation steps: 5000
Elapsed simulation time: 5.0s
Elapsed real time: 12.632072546s
=============================================
For a curved (arc) Multilane road.
michel@abydos:~/Workspaces/delphyne_ws$ delphyne-mobil-perf curved_lanes
Running simulation for 5.0 seconds.
Simulation ended. I'm happy, you should be too.
= Simulation stats ==========================
Simulation runs: 1
Simulation steps: 5000
Elapsed simulation time: 5.0s
Elapsed real time: 17.910087965s
=============================================
Great, thanks for the data @hidmic ! I would love to see those numbers again once we land the performance improvements on multilane (in particular how straight_lanes
compares to dragway
)
@hidmic -- RobotLocomotion/drake#9113 landed so caching is in Drake master now. As mentioned above it is off by default so you have to write context.EnableCaching()
to turn it on.
Further improvements on MOBIL car on hold. Anything we do right now will only be patching when what we really need to consider is a better ado car management architecture.
@agalbachicar I'll un-assign myself for now. Pull me back in if need be!
(Coming from #455)
Profile MOBIL can and all its pieces trying to find possible bottlenecks. We are able to only have 20 MOBIL cars simulating @ 100Hz on my machine, so it would be desired to bump that number up.