Open VRichardJP opened 2 years ago
@corot I don't include cmdvel in my benchmark so I overlooked it. I guess the issue here is that when PoseSE2 is copied sin/cos gets calculated over and over instead of being copied. For example with the current code, something like:
pose1.theta() = pose2.theta();
is equivalent to:
pose1.theta()._theta = pose2.theta()._theta;
pose1.theta().update_sincos();
Maybe adding a simple copy constructor would fix the performance in your benchmark:
Theta(const Theta& other) {
_rad = other._rad;
_sin = other._sin;
_cos = other._cos;
}
That being said, I don't really like to keep implicit Theta->double and double->Theta conversion as Theta comes with an upfront cost. I am considering 2 ways to tackle this:
const double& sin() {
if (!_init) {
update_sincos();
_init = true;
}
return _sin;
}
What do you think?
I have tested both approach 1 and 2 in my setup. The difference in performance is negligible on my machine: the implementation 1 is ~1% slower compared to the implementation 2, while being way simpler and less prone to bugs.
EDIT: I observe with google perftools that with the lazy sincos version, the total number of sincos call is reduced by 2/3 compared to original branch. This bears out my initial observation that sin/cos computation is largely duplicated.
Hi, I'm a bit confused to see it does not improve things in your side =/
To benchmark the changes I am using a private fork of melodic-devel
, but besides a few extra options or weights the functionality/performance should be the same. I am measuring performance using test_optim_node
: I start the node, load a scenario (custom set of obstacle and start/stop pose), do N loops of CB_mainCycle
then stop. I have more than 200 scenarios. Although the time I measure includes some ros setup and initialization, with my PR I get a clean +10% performance gain on the whole benchmark, which means the performance on the plan()
function itself is even bigger.
For example, I have generated a call graph with google-perftools of one of the scenario without using my changes (RelWithDebInfo
mode with SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--no-as-needed,-lprofiler,--as-needed")
, ran with CPUPROFILE_FREQUENCY=1000 CPUPROFILE=/tmp/teb.profile roslaunch <...>
and visualized with google-pprof --web install/teb_local_planner/lib/teb_local_planner/test_optim_node /tmp/teb.profile
):
You can see at the bottom of the picture that test_optim_node
spends 34% of its time in _sincos
.
This is the same graph but with the PR:
The time spent in _sincos
has dropped to 12% because eval_sincos
catches many duplicates. For some reason you can see that TwoCirclesRobotFootprint::calculateDistance
still manage to call _sincos
directly and does not go through eval_sincos
. I am wondering why it happens, maybe it is an after effect of the compiler optimization.
From your performance I understand that my PR has no effect on your machine. I see 2 possible causes:
_sincos
calls are replaced by a table lookup after optimization. In that case, my PR is absolutely useless and just adds an useless if
and extra data to copy. What CPU/compiler are you using? on my side, I have a Intel i7-9700 @ 3.00GHz and uses gcc 7.5.0 compilertwo_circles
footprint. From my understanding of teb, the world/vehicle/tuning used does just influence the number of sin/cos call, which are duplicated in all cases. So it should not be a thing. Hi, I'm a bit confused to see it does not improve things in your side =/
me too! I tried to rule out some possibilities:
What's that test_optim_node
? I can give it a try.
Could you also test just recording execution time for computeCmdVel calls? I use this simple code:
two_circles
in the distance to obstacle calculation. With the latter it is necessary to compute some sin/cos to evaluate the position of each circle center point. That said there are still many other places that heavily use sin/cos and should theoretically benefits from this PR.test_optim_node
node: https://github.com/rst-tu-dortmund/teb_local_planner/blob/melodic-devel/src/test_optim_node.cpp This is the callback I use for benchmarking purpose: https://github.com/rst-tu-dortmund/teb_local_planner/blob/ec5759d0bf982cd2266715bc06a3385309ec3eb1/src/test_optim_node.cpp#L165-L170
Some edges compute their Jacobian numerically. This might also explain why you observe a lot of calls to sin/cos in general. For those edges, caching does not help when evaluating wrt theta but for x/y. For edges with analytic Jacobian, the overhead of caching might lead to negative impact. Maybe benchmark per Edge on computeError/linearizeOplus leads to interestings insights.
As discussed in https://github.com/rst-tu-dortmund/teb_local_planner/issues/340