precompute sin/cos to improve performance

VRichardJP commented 2 years ago

As discussed in https://github.com/rst-tu-dortmund/teb_local_planner/issues/340

VRichardJP commented 2 years ago

@corot I don't include cmdvel in my benchmark so I overlooked it. I guess the issue here is that when PoseSE2 is copied sin/cos gets calculated over and over instead of being copied. For example with the current code, something like:

pose1.theta() = pose2.theta();

is equivalent to:

pose1.theta()._theta = pose2.theta()._theta;
pose1.theta().update_sincos();

Maybe adding a simple copy constructor would fix the performance in your benchmark:

Theta(const Theta& other) {
    _rad = other._rad;
    _sin = other._sin;
    _cos = other._cos;
  }

That being said, I don't really like to keep implicit Theta->double and double->Theta conversion as Theta comes with an upfront cost. I am considering 2 ways to tackle this:

Keep implicit conversion but make sin/cos evaluation lazy. This is really easy to implement and as little impact on the rest of the code but it comes with a small runtime cost. For example:
```
const double& sin() {
if (!_init) {
    update_sincos();
    _init = true;
}
return _sin;
}
```
Make conversion explicit both ways and modify the rest of teb code accordingly. There is no runtime cost but it adds complexity in the code. Eventually, there are cases where sin/cos values are not necessary and could be left undefined, but that could be the source of bugs in the future.

What do you think?

VRichardJP commented 2 years ago

I have tested both approach 1 and 2 in my setup. The difference in performance is negligible on my machine: the implementation 1 is ~1% slower compared to the implementation 2, while being way simpler and less prone to bugs.

EDIT: I observe with google perftools that with the lazy sincos version, the total number of sincos call is reduced by 2/3 compared to original branch. This bears out my initial observation that sin/cos computation is largely duplicated.

VRichardJP commented 2 years ago

Hi, I'm a bit confused to see it does not improve things in your side =/

To benchmark the changes I am using a private fork of melodic-devel, but besides a few extra options or weights the functionality/performance should be the same. I am measuring performance using test_optim_node: I start the node, load a scenario (custom set of obstacle and start/stop pose), do N loops of CB_mainCycle then stop. I have more than 200 scenarios. Although the time I measure includes some ros setup and initialization, with my PR I get a clean +10% performance gain on the whole benchmark, which means the performance on the plan() function itself is even bigger.

For example, I have generated a call graph with google-perftools of one of the scenario without using my changes (RelWithDebInfo mode with SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--no-as-needed,-lprofiler,--as-needed"), ran with CPUPROFILE_FREQUENCY=1000 CPUPROFILE=/tmp/teb.profile roslaunch <...> and visualized with google-pprof --web install/teb_local_planner/lib/teb_local_planner/test_optim_node /tmp/teb.profile):

You can see at the bottom of the picture that test_optim_node spends 34% of its time in _sincos.

This is the same graph but with the PR:

The time spent in _sincos has dropped to 12% because eval_sincos catches many duplicates. For some reason you can see that TwoCirclesRobotFootprint::calculateDistance still manage to call _sincos directly and does not go through eval_sincos. I am wondering why it happens, maybe it is an after effect of the compiler optimization.

From your performance I understand that my PR has no effect on your machine. I see 2 possible causes:

_sincos calls are replaced by a table lookup after optimization. In that case, my PR is absolutely useless and just adds an useless if and extra data to copy. What CPU/compiler are you using? on my side, I have a Intel i7-9700 @ 3.00GHz and uses gcc 7.5.0 compiler
teb parameters and/or the scenario: my test world is quite big (generated paths contains around 300 points), my vehicle is a carlike (non-holonomic) using two_circles footprint. From my understanding of teb, the world/vehicle/tuning used does just influence the number of sin/cos call, which are duplicated in all cases. So it should not be a thing.

corot commented 2 years ago

Hi, I'm a bit confused to see it does not improve things in your side =/

me too! I tried to rule out some possibilities:

I'm using circular footprint model. I have tried a similar test with polygon footprint model but I got similar results: no improvement.
I tried compiling on DEBUG, but again, I got similar results: no improvement.

What's that test_optim_node? I can give it a try. Could you also test just recording execution time for computeCmdVel calls? I use this simple code:

time.diff.zip

VRichardJP commented 2 years ago

circular footprint is faster than two_circles in the distance to obstacle calculation. With the latter it is necessary to compute some sin/cos to evaluate the position of each circle center point. That said there are still many other places that heavily use sin/cos and should theoretically benefits from this PR.
test_optim_node node: https://github.com/rst-tu-dortmund/teb_local_planner/blob/melodic-devel/src/test_optim_node.cpp
and its launch file: https://github.com/rst-tu-dortmund/teb_local_planner/blob/melodic-devel/launch/test_optim_node.launch

This is the callback I use for benchmarking purpose: https://github.com/rst-tu-dortmund/teb_local_planner/blob/ec5759d0bf982cd2266715bc06a3385309ec3eb1/src/test_optim_node.cpp#L165-L170

RainerKuemmerle commented 2 years ago

Some edges compute their Jacobian numerically. This might also explain why you observe a lot of calls to sin/cos in general. For those edges, caching does not help when evaluating wrt theta but for x/y. For edges with analytic Jacobian, the overhead of caching might lead to negative impact. Maybe benchmark per Edge on computeError/linearizeOplus leads to interestings insights.

rst-tu-dortmund / teb_local_planner

precompute sin/cos to improve performance #341