maliput / delphyne

Scenario and search based Ego/Ado Car traffic simulations
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Profile MOBIL car #470

Open basicNew opened 6 years ago

basicNew commented 6 years ago

(Coming from #455)

Profile MOBIL can and all its pieces trying to find possible bottlenecks. We are able to only have 20 MOBIL cars simulating @ 100Hz on my machine, so it would be desired to bump that number up.

hidmic commented 6 years ago

Alright, we have an update on this end.

I had to take Python out of the way (had issues forcing CMake and Bazel to use a custom built interpreter for profiling), so I wrote a couple demo replicas (namely, gazoo and dragway) in pure C++ (see branch for reference). Then, I spent quite an amount of time fighting different profilers, each with it's own pros and cons:

Interestingly enough, I'm under the impression that the main performance issue comes from Drake's framework itself.

Every time a System evaluates an input port, the current implementation queries the output port it is connected to, which in turn may trigger further evaluations, effectively traversing the relevant branches of the Diagram the system is part of. That we know. But the thing is, each and every query involves a call to the associated Calc... function of the port, there's no caching of the query result yet. Thus, depending on the wiring, you may end up recomputing the same values many many times[1, 2], which is bad when doing so is expensive (like it is for e.g. the MobilPlanner system used by our MobilCar agent). On top of this, because output port value calculation is usually preceded by allocation, you may also hit the heap way more than it is necessary [2].

The good thing is that there's ongoing work to improve this, I was referred to https://github.com/RobotLocomotion/drake/pull/7668 when I asked about it on Slack.

[1] PoseAggregator triggering millions of calls (valgrind, callgrind plugin)

pose_aggregtor_callees

[2] Eigen hitting the heap, hard (google-perftools, heap-profiler).

Total: 5983.4 MB
  3546.2  59.3%  59.3%   3546.2  59.3% Eigen::internal::aligned_malloc
  1787.2  29.9%  89.1%   1787.2  29.9% drake::multibody::collision::BulletModel::CollidingPoints
   200.4   3.3%  92.5%    200.5   3.4% drake::systems::rendering::FrameVelocity::FrameVelocity
   122.1   2.0%  94.5%    122.1   2.0% drake::systems::LeafCompositeEventCollection::LeafCompositeEventCollection
    39.9   0.7%  95.2%     39.9   0.7% drake::systems::DiagramCompositeEventCollection::DiagramCompositeEventCollection
    39.5   0.7%  95.9%   3585.2  59.9% std::vector::_M_emplace_back_aux
    35.7   0.6%  96.4%     37.5   0.6% drake::hash_append
    31.3   0.5%  97.0%    159.4   2.7% drake::automotive::PoseSelector::FindClosestPair
    21.6   0.4%  97.3%    143.3   2.4% drake::systems::LeafSystem::AllocateCompositeEventCollection
    15.3   0.3%  97.6%     16.1   0.3% drake::systems::rendering::PoseBundle::PoseBundle
    14.4   0.2%  97.8%     14.4   0.2% Eigen::internal::conditional_aligned_new_auto
    12.4   0.2%  98.0%     12.4   0.2% std::__cxx11::basic_string::_M_assign
     7.8   0.1%  98.2%      7.8   0.1% Eigen::internal::call_dense_assignment_loop
     7.3   0.1%  98.3%      7.6   0.1% drake::automotive::CarVisApplicator::get_load_robot_message

First column on the left is the total amount of memory, in MB, allocated by the named function during program execution (not to be confused with the total amount of memory in use, which is the total amount of memory allocated minus the total amount of memory freed so far).

basicNew commented 6 years ago

Super interesting @hidmic ! Now, can we implement a hardcoded, super an-hoc caching mechanism (e.g. in PoseAggregator) to see if we can confirm your hypothesis? I would hate to wait until Drake PRs are merged only to find out that it does not fix our issue.

hidmic commented 6 years ago

@basicNew Hmm, which hypothesis do you want to validate?

We can confirm these re-computations do take place with a dummy diagram like the one below, checking that evaluating E's input once triggers B's output computation twice:

    A +-------> C+----------+
                ^           v
                |           E
                |           ^
    B +------------->D+-----+

It's (way) trickier to validate that these duplicated computations are the (sole?) root cause for MOBIL cars' degraded performance though.

Doing some very coarse caching in the PoseAggregator, as you suggest, reduces CPU load but it makes it more hungry for RAM (always in total allocated memory, I presume because of the extra copy of the bundle since lots of heap-allocated Eigen matrices are being moved around). To put it in numbers, compare the total reduction in callee call counts. Also, having 60 MOBIL cars distributed on a 4-lane dragway makes /visualizer/scene_updates topic messages go out at ~20 Hz (realtime) without the change and at ~40Hz (realtime) with the change.

pose_aggregtor_callees_with_caching

Still, cache support isn't reaching each MOBIL car (sub)diagram (and those are quite convoluted, so I'd imagine it suffers from this as well).

Finally, taking a look below automotive, I would not discard that part of the implementation may not scale that well as more cars get added to the world (e.g. see PoseSelector system implementation). But that's just a hunch, I have not profiled nor studied the complexity of each component individually.

basicNew commented 6 years ago

Yes, my point here was to do some very rough approximation to see if, by caching, we were getting any of the expected benefits; doubling the update frequency goes a promising direction. So, how do you suggest we move forward with this?

hidmic commented 6 years ago

@basicNew What if we reach out to the author of https://github.com/RobotLocomotion/drake/pull/7668 to see how we can give this new caching support a shot (locally I mean)? If it works[*], I wouldn't invest more time on this (at least for M3) and I'd push/wait for the feature to get released.

If it doesn't work (i.e. there're more bottlenecks, but in the systems themselves), we'll have to benchmark each of the (relevant) automotive systems in isolation. I'm sure there are aspects of the implementation and/or the algorithms that could use an improvement. It's a bigger task though.

[*] BTW, I see that we have goal of at least 20 agents @ 100 Hz, but I think we're lacking a target host. What kind of hardware are we expecting Delphyne users to have?

stonier commented 6 years ago

Avidly watching these performance threads. Certainly would like to learn how exactly Drake systems is doing it's computation. Can you jot notes as you discover things @hidmic? I've found precious little other documentation (design notes or user documentation) around. Have resorted to just learning by discovery over the last couple of weeks.

Realistically, I think we should be able to nail that target on a decent laptop (for the backend at least). It really shouldn't be a thrashing machine - we used to run a far more intensive vision slam on an very low spec'd arm core.

If we're going to seriously work on the systems, might also be time to just port them all in from drake::automotive so we can work on them rapidly here. I started doing that for the rail follower in branch stonier/rail_follower.

Can work out what to do with them later once we've done surgery on them.

basicNew commented 6 years ago

@hidmic I think reaching out to Sherm would be a good thing to do, both to understand ETAs and to let him know we are also interested in these improvements. Also, it would be great if we could get a branch to start benchmarking MOBIL cars with the system cache on.

Another place were we could start digging is in maliput queries (we talked about this a couple of weeks ago). I think the dragway implementation should be pretty straightforward, so I'm not expecting much penalty there, but definitely worth checking (a different story will be multilane, but we can deal with that once we get there).

Finally, I like the idea of looking inside the MOBIL car and understand what are the major bottlenecks; e.g. a MOBIL car uses a MobilPlanner, an IdmController, a PurePursuitController, etc. We could start by comparing them and checking if something stands out.

hidmic commented 6 years ago

@basicNew @stonier I spoke to Michael Sherman, the author of https://github.com/RobotLocomotion/drake/pull/7668 (tracked by https://github.com/RobotLocomotion/drake/issues/4364). ETA was (roughly) a couple weeks for the entire caching support to merged into master. Latest related PR is https://github.com/RobotLocomotion/drake/pull/9067. He also gave me green light to pull his feature branch and test. Which I've done.

Again, having 60 MOBIL cars distributed on a 4-lane dragway makes /visualizer/scene_updates topic messages go out at ~20 Hz (realtime) with caching disabled and at ~60 Hz (realtime) with caching enabled.

Now most of the time seems to be spent within MobilePlanner::CalcLaneDirection, specifically when it makes calls to the PoseSelector class.

calc_lane_direction_profile

I'll see if I can improve it or at least pin point the main issue.

basicNew commented 6 years ago

Great, thanks for the heads up @hidmic ! 3x with caching and a single method consuming that much is a very promising start! It would be great to see what we can change in CalcLaneDirection to further improve things.

hidmic commented 6 years ago

@basicNew @stonier Ok, confirmed. PoseSelector's implementation is another major bottleneck. I did a quick local hack of the main loop inside PoseSelector::FindSingleClosestInDefaultPath(). Namely, traffic_isometry can be a reference and pasting IsWithinLane() code in the loop removes an extra ToLanePosition() call and removes the need to repeatedly query for RoadGeometry linear tolerance.

With just that, we get messages at ~75 Hz with the same 60 MOBIL cars in the same 4-lane dragway. Looking through the code, there are many places where queries are duplicated for the sake of code clarity and locality.

Incidentally, while trying to speed up valgrind, I replaced all Prius meshes by boxes. Message frequency got close to ~90 Hz. The CarVisApplicator may be another (medium sized) player.

It remains to be seen how this winds up, but evidence seems to suggest that we are pushing the limits of automotive's implementation (if not Drake's). In the short-term, we can do some surgery to PoseSelector's implementation. But moving forward, I think it'd be helpful to have a clear goal to know how deep we're going in our optimization crusade :).

sherm1 commented 6 years ago

Glad to see caching was helpful! Thanks for the beautifully-executed experiment, @hidmic.

hidmic commented 6 years ago

@basicNew @stonier do we have green light to start PoseSelector refactor locally? Or do you guys want to wait for official caching support? (BTW thank you @sherm1 for the feature!)

basicNew commented 6 years ago

I would vote for starting the PoseSelector refactor now.

hidmic commented 6 years ago

@basicNew @stonier Update! I spent quite some time understanding the PoseSelector and trying to reduce its overhead (rearranging code, reducing repeated calls, etc.). But so far I've failed to go beyond that last ~75 Hz metric I reported (with caching enabled on the Drake's side of course). Here's the branch with the latest additions below our systems directory. The previous branch with C++ demos to profile was rebased on top.

The problem lies on the fact that both MobilPlanner and IdmController systems make multiple calls to PoseSelector functions (static methods to be more accurate), that in turn perform exhaustive searches (e.g. every car is going through all other cars checking for which one is ahead of it on every single step).

So to get that final boost, we need whole new algorithms. We can leverage the knowledge of other cars' pose, velocity and acceleration to skip checks for the ones that won't be close enough soon (asumming cars don't teleport, which might break on e.g. dragways, that do have such device).

Let me know if we have green light to start researching on the topic.

basicNew commented 6 years ago

hmmh, interesting. Some time ago the idea of a "swarm" controller (as opposed to single-car controllers) was brought up by @stonier as a way to be more efficient. Maybe we can do a time-bounded thing (2 days top) to do some research in this direction, suing this particular computation as a concrete example of performance optimization?

On the other hand, with current optimizations I think we are ok for the demos we want to run, so I would hate to keep investing time on something that is not required for this milestone.

What do you think @stonier ?

stonier commented 6 years ago

Sorry for the delay, missed the notification from the first question last week.

Matt/Jon had mentioned something like that which you describe, but neither had done any digging. What are the exhaustive searches that are done in turn?

@basicNew it wasn't exactly a swarm controller, though that is an option. I was thinking about simulator-level systems to expediate calculations on data from the other agent systems + pose aggregator. Does this make sense here? Or would it be redundant if there were no mobil cars?

With regards to decision making, I think we should go forward with pulling the agent systems/states/params and pose selector into delphyne (in the spirit of #498 and PR 499). We are going to have to dissect those pretty hard anyway. Once they're in, we can at least put proposals in a doc - that would be good to have while it's fresh in everybody's mind, regardless of whether we do them for M3 or not.

hidmic commented 6 years ago

@stonier Sorry too, I did not give you guys any pointers to the actual code. I'll use our version of these systems, as I've removed silly duplications and weird data structures (e.g. LaneDirection) there already.

The MobilPlanner system queries the current and side lanes (if any). On each query, it scans behind and ahead of the ego car. For each scan direction, it follows the the default lane path (using default ongoing branches or the first found, see here), looking for the closest traffic car in that path, essentially querying them all. That happens on every simulation step.

The IdmController system only scans ahead in the current lane, but it's currently looking for branches too, so from each traffic car position (that is in turn found by a lookup throughout the entire road geometry) it traverses the road back to the ego car to find the closest one. That happens on every simulation step.

So far, running a dragway-based demo on top of valgrind shows that PoseSelector calls from the MobilPlanner make up for roughly ~30% of the CPU load. Methods like Lane::ToLanePositionT, which are rather inexpensive, are called millions of times and thus become relevant. I can only imagine how hard would be the hit on a more complex road (valgrind's performance penalty is so large that it takes ages just to load a multilane road mesh).

hidmic commented 6 years ago

@basicNew @stonier Using a swarm controller makes sense only for MOBIL cars right now. We'd still need to figure out how to optimize away redundant computations in that case.

I spent some time thinking on how to overcome these issues in a (relatively) simple way. If we could ban teleportation (modifying dragway implementation for starters), and we had knowledge of maximum velocities and accelerations, then we could narrow the search domain. At a (very) high level, we could:

  1. Skip traffic cars whose euclidean distance in the global frame w.r.t. the ego car is larger than the specified scan distance. Defer next check to the time it'd take the traffic car to enter the scan distance radius based on a worst case estimate of how the traffic car and the ego car are moving (e.g. something rough like (traffic_distance - scan_distance) / (2 * top_speed), to account for the relative speed between the two cars, or using headings, whatever works best).
  2. Shortest path search from each remaining traffic car to the ego car using the road. Still haven't figured out how to model lane changes in that graph.
  3. Skip traffic cars whose graph (or lane) distance w.r.t the ego car is larger the specified scan distance. Defer next check to the time it'd take the traffic car to get closer within a scan distance based on a worst case estimate of how the traffic car and the ego car are moving.
  4. Adaptively scale the scan distance based on the proximity to cars. If traffic is sparse (like in an empty road), increase the scan distance. if traffic is dense (like in a jam), decrease the scan distance.

And only then perform the kind of lookups that we're using now (reusing as much computations from previous steps as possible). PoseSelector will no longer be stateless.

hidmic commented 6 years ago

But as you can see, it's a nontrivial task.

+1 to pull in these classes. Clean ones from Drake or the modified ones in this branch? I can start a document with above's brain dump too.

hidmic commented 6 years ago

Design doc opened.

stonier commented 6 years ago

I added some notes in the doc. Maybe a chat on hangout would expedite things?

As for moving over the systems, hard to see what you changed in that branch since it's in the one commit (i.e. all new). When bringing them in, just bring in them in raw in one commit, then changes in additional commits and don't squash the end result so we can trace what we've been doing. Also for RailFollower, take what's here in delphyne now and add the changes on top f that (i.e. LaneDirection etc).

We can probably do this incrementally over a few PRs. I was actually going to bring in one of the systems tonight till you distracted me with all the updates :)

hidmic commented 6 years ago

@stonier Thanks! Replied all your comments. We can chat whenever you want/can.

Noted. I'll open a PR that brings the classes fresh with no changes in the first commit and then applies all changes in a second commit.

hidmic commented 6 years ago

Alright, #501 opened. Also, #503 and ToyotaResearchInstitute/delphyne-gui#140 introduce a small demo that, by default, puts 20 MOBIL cars on common roads. The idea is to use it from now on to check where we're in terms of performance and also to get a grasp of the impact hardware variations have (at least among us Delphyne devs). If you guys can run it and report stats back, that'd be great.

Also, based on a brief discussion I had w/ @basicNew, I went on to try gazoo demo too, with the same 60 MOBIL cars, though not on top of valgrind (as I said before, the performance penalty is insane, even at the road geometry mesh generation step). Results were less encouraging, though not entirely unexpected:

Computation Policy Build Type Scene Update Frequency
prefer-accuracy debug ~7 Hz
prefer-accuracy release ~10 Hz
prefer-speed debug ~14 Hz
prefer-speed release ~20 Hz

Caching was enabled on Drake's side. This reinforces the fact that Multilane's is slow if always accurate (FYI solutions for this are in the works).

hidmic commented 6 years ago

@basicNew @stonier given that we know that:

  1. Having Drake's caching support enabled is a must.
  2. Having Multilane's optimized computations is a must.
  3. Reducing MOBIL cars' algorithmic complexity is the next step forward, starting at the PoseSelector class (and we have a design doc to track this).
  4. We have a few demos to check now.

I'd say we can close this issue. We can open up another issue to track MOBIL car system improvements. What do you think?

basicNew commented 6 years ago

I would prefer to keep this one open until we close all the other issues.

About the perf demo (+100 by the way). What are your stats on that? (i.e. dragway vs linear multilane vs arc multilane)

sherm1 commented 6 years ago

Heads-up re caching: Drake PR #9113 is in review. When it lands (likely next week), Drake master will have caching but it will be disabled by default. context.EnableCaching() turns it on. A subsequent PR will change the default.

hidmic commented 6 years ago

That's awesome @sherm1! Thank you for the FYI.

hidmic commented 6 years ago

@basicNew sorry for the delay to reply, I had to rebuild everything for Release from scratch, using Drake caching plus the latest MOBIL classes refactor. My stats are shown below:

For a dragway.

michel@abydos:~/Workspaces/delphyne_ws$ delphyne-mobil-perf dragway
Running simulation for 5.0 seconds.
Simulation ended. I'm happy, you should be too.
= Simulation stats ==========================
  Simulation runs: 1
  Simulation steps: 5000
  Elapsed simulation time: 5.0s
  Elapsed real time: 7.580936824s
=============================================

For a straight Multilane road.

michel@abydos:~/Workspaces/delphyne_ws$ delphyne-mobil-perf straight_lanes
Running simulation for 5.0 seconds.
Simulation ended. I'm happy, you should be too.
= Simulation stats ==========================
  Simulation runs: 1
  Simulation steps: 5000
  Elapsed simulation time: 5.0s
  Elapsed real time: 12.632072546s
=============================================

For a curved (arc) Multilane road.

michel@abydos:~/Workspaces/delphyne_ws$ delphyne-mobil-perf curved_lanes
Running simulation for 5.0 seconds.
Simulation ended. I'm happy, you should be too.
= Simulation stats ==========================
  Simulation runs: 1
  Simulation steps: 5000
  Elapsed simulation time: 5.0s
  Elapsed real time: 17.910087965s
=============================================
basicNew commented 6 years ago

Great, thanks for the data @hidmic ! I would love to see those numbers again once we land the performance improvements on multilane (in particular how straight_lanes compares to dragway)

sherm1 commented 6 years ago

@hidmic -- RobotLocomotion/drake#9113 landed so caching is in Drake master now. As mentioned above it is off by default so you have to write context.EnableCaching() to turn it on.

stonier commented 6 years ago

Further improvements on MOBIL car on hold. Anything we do right now will only be patching when what we really need to consider is a better ado car management architecture.

hidmic commented 3 years ago

@agalbachicar I'll un-assign myself for now. Pull me back in if need be!