rst-tu-dortmund / teb_local_planner

An optimal trajectory planner considering distinctive topologies for mobile robots based on Timed-Elastic-Bands (ROS Package)
http://wiki.ros.org/teb_local_planner
BSD 3-Clause "New" or "Revised" License
1.06k stars 550 forks source link

precompute sin/cos to improve performance #341

Open VRichardJP opened 2 years ago

VRichardJP commented 2 years ago

As discussed in https://github.com/rst-tu-dortmund/teb_local_planner/issues/340

VRichardJP commented 2 years ago

@corot I don't include cmdvel in my benchmark so I overlooked it. I guess the issue here is that when PoseSE2 is copied sin/cos gets calculated over and over instead of being copied. For example with the current code, something like:

pose1.theta() = pose2.theta();

is equivalent to:

pose1.theta()._theta = pose2.theta()._theta;
pose1.theta().update_sincos();

Maybe adding a simple copy constructor would fix the performance in your benchmark:

Theta(const Theta& other) {
    _rad = other._rad;
    _sin = other._sin;
    _cos = other._cos;
  }

That being said, I don't really like to keep implicit Theta->double and double->Theta conversion as Theta comes with an upfront cost. I am considering 2 ways to tackle this:

  1. Keep implicit conversion but make sin/cos evaluation lazy. This is really easy to implement and as little impact on the rest of the code but it comes with a small runtime cost. For example:
    const double& sin() {
    if (!_init) {
        update_sincos();
        _init = true;
    }
    return _sin;
    }
  2. Make conversion explicit both ways and modify the rest of teb code accordingly. There is no runtime cost but it adds complexity in the code. Eventually, there are cases where sin/cos values are not necessary and could be left undefined, but that could be the source of bugs in the future.

What do you think?

VRichardJP commented 2 years ago

I have tested both approach 1 and 2 in my setup. The difference in performance is negligible on my machine: the implementation 1 is ~1% slower compared to the implementation 2, while being way simpler and less prone to bugs.

EDIT: I observe with google perftools that with the lazy sincos version, the total number of sincos call is reduced by 2/3 compared to original branch. This bears out my initial observation that sin/cos computation is largely duplicated.

VRichardJP commented 2 years ago

Hi, I'm a bit confused to see it does not improve things in your side =/

To benchmark the changes I am using a private fork of melodic-devel, but besides a few extra options or weights the functionality/performance should be the same. I am measuring performance using test_optim_node: I start the node, load a scenario (custom set of obstacle and start/stop pose), do N loops of CB_mainCycle then stop. I have more than 200 scenarios. Although the time I measure includes some ros setup and initialization, with my PR I get a clean +10% performance gain on the whole benchmark, which means the performance on the plan() function itself is even bigger.

For example, I have generated a call graph with google-perftools of one of the scenario without using my changes (RelWithDebInfo mode with SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--no-as-needed,-lprofiler,--as-needed"), ran with CPUPROFILE_FREQUENCY=1000 CPUPROFILE=/tmp/teb.profile roslaunch <...> and visualized with google-pprof --web install/teb_local_planner/lib/teb_local_planner/test_optim_node /tmp/teb.profile): image

You can see at the bottom of the picture that test_optim_node spends 34% of its time in _sincos.

This is the same graph but with the PR: image

The time spent in _sincos has dropped to 12% because eval_sincos catches many duplicates. For some reason you can see that TwoCirclesRobotFootprint::calculateDistance still manage to call _sincos directly and does not go through eval_sincos. I am wondering why it happens, maybe it is an after effect of the compiler optimization.

From your performance I understand that my PR has no effect on your machine. I see 2 possible causes:

corot commented 2 years ago

Hi, I'm a bit confused to see it does not improve things in your side =/

me too! I tried to rule out some possibilities:

What's that test_optim_node? I can give it a try. Could you also test just recording execution time for computeCmdVel calls? I use this simple code:

time.diff.zip

VRichardJP commented 2 years ago

This is the callback I use for benchmarking purpose: https://github.com/rst-tu-dortmund/teb_local_planner/blob/ec5759d0bf982cd2266715bc06a3385309ec3eb1/src/test_optim_node.cpp#L165-L170

RainerKuemmerle commented 2 years ago

Some edges compute their Jacobian numerically. This might also explain why you observe a lot of calls to sin/cos in general. For those edges, caching does not help when evaluating wrt theta but for x/y. For edges with analytic Jacobian, the overhead of caching might lead to negative impact. Maybe benchmark per Edge on computeError/linearizeOplus leads to interestings insights.