Closed traversaro closed 8 years ago
A preliminary profiling with the valgrind tool callgrind shows that 20% of the instructions is used by Joint::GetTransform()
(confirming the intuition discussed in the previous comment) and 10% of the instructions is used by the empty constructor of Vector3
(because its content is initialized to 0):
Note that 10% of instructions does not account to 10% of the execution time: on modern architectures most of the execution time is lost on cache misses and branch misprediction. To profile this aspects you can use the advanced functionality of cachegrind
, and collect profiling data on the actual execution of the program using perf (on Linux).
:+1: excellent test!
Another big bottleneck is apparently all the virtual destructor of "basic" classes such as VectorFixSize
, Position
, Transform
, SpatialVector
, etc etc .
All this classes where explicitly designed such that they could be easily used as local variable to a function (and returned by value/copied) with minimal overhead. However we forgot something:
In this screenshot you can see that the iDynTree::AngularForceVector3::~AngularForceVector3()
accounts for the incredible number of ~126 millions of instructions, while (as a comparison) the CRBA (i.e. calculation of the mass matrix) with the old implementation alone is just ~200 millions of instructions (this are the numbers for a big number of runs of the algorithms). The RNEA with the old implementation instruction count is so low that is not part of the screenshot at all (the function are sorted by the instruction count!).
The big problem is that the iDynTree::AngularForceVector3::~AngularForceVector3()
destructors is not doing anything at all! The problem is that the compiler is calling the virtual destructor of all its base class. Even the iDynTree::IVector::~IVector()
destructor accounts for ~50 Millions instructions!
The solution to this is to transform all the destructors of interfaces that are not meant for polymorphism (so all the interfaces, except IJoint
and ISensor
) to be protected and not-virtual (as opposed to be public and virtual. [1] [2]
We need then to provide non virtual destructors that do nothing (the default destructor should be ok) to all classes that are meant to be used efficiently as local variables in function (the one that we listed before). We should also be sure that this destructors are properly inline by the compiler, such that the destrunction of a class of those types have zero overhead.
cc @francesco-romano
[1] : http://en.cppreference.com/w/cpp/language/destructor (See "Virtual destructors" section) [2] : http://www.gotw.ca/publications/mill18.htm (see "Guideline 4).
I discussed with @francesco-romano and he is ok with what we discussed in https://github.com/robotology/idyntree/issues/98#issuecomment-158823148 . I think I will first implement this and the removing the initialization in VectorFixSize
and MatrixFixSize
, because they are clearly performance bugs, while the caching of transforms in joint classes is a bit more complicated and controversial (it basically breaks thread safety of the Joint classes .
I like this:
... this process was relatively in open loop... and now it is time to close the loop.
First attempt to remove setters to zero in the default constructor :
63% tests passed, 24 tests failed out of 65
Just broke the 40 % of the tests, not bad! :dancers: (Including also the valgrind tests that detect any use of uninitialized memory).
100% tests passed, 0 tests failed out of 66
:-)
First round of improvements was done by removing initialization of memory in constructors : https://github.com/robotology/idyntree/commit/2d091137e90ab7a9919a5edc0f3781f825c75146
Tests using Valgrind was added (and then fixed) to make sure we did not miss any use of not initialized objects:
Everything is commited for now in the https://github.com/robotology/idyntree/compare/perfImprovements branch .
Tomorrow I will remove the virtual hierarchy .
We are getting closer:
KDL-based implementation :
RNEA average time 34.052 microseconds
CRBA average time 55.274 microseconds
iDynTree-based implementation :
RNEA average time 212.471 microseconds
CRBA average time 154.778 microseconds
iDynTree/KDL ratio :
RNEA ratio 6.2396
CRBA ratio 2.8002
I just brought the RNEA slowdown from 9 to 6 by just removing all the empty constructors from most the "core" classes.. this may seems counterintuitive (why the compiler is not inlining the empty method, is such an obvious optimization!), but it is actually quite logic: as the core classes and the dynamics algorithms are in two different libraries, if a empty destructor (not inline) is defined in a core
class, the compiler has no way of knowing it is empty until the linking, and CMake does not enables the LTO (Link Time Optimizations) even in Release
mode!
I started to optimize more in depth, reaching the semantics classes, for which I would ask to @nunoguedelha to briefly review my modifications.
The first problem was that semantics operation are optimized out in Release, but their initialization is not optimized out, this was solved in https://github.com/robotology/idyntree/commit/af5476d55a1175bd9484d13c7060cc2864e1c0a0 (expect for the TransformSemantics and SpatialVectorSemantics that store references and so need to be initialized, but this is not the bottleneck at the moment).
The next problem is much more tricky, and I need to write down what I am thinking because I lost track of the rationale of the changes that I wanted to make several times... nothing, lost again.
With the last commit, we reached :
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/i
KDL-based implementation :
RNEA average time 34.352 microseconds
CRBA average time 55.281 microseconds
iDynTree-based implementation :
RNEA average time 47.93 microseconds
CRBA average time 52.908 microseconds
iDynTree/KDL ratio :
RNEA ratio 1.39526
CRBA ratio 0.957074
As we are still missing a fundamental optimization, to consider the "fake links" present in the URDF as frames, and thus don't run the dynamics algoritms on them (see https://github.com/robotology/codyco-modules/issues/39) and the icub2BB5Sea.urdf
is full of frames (MTB, EMS, etc etc). I think that even if we did not match perfectly the old implementation the first round of optimization can be considered finished.
The biggest issue in the end was that the class hierarchy is too deep, and apparently even if all constructors are empty, the compiler is not able to inline all them even by compiling all the library in the same translation unit.
In this image you can se how many functions (not inlined!!) are called for a single Wrench::operator+
. As most of them just call some other function, I was expecting the compiler to inline them, but apparently this is not the case, at least in my version of GCC. The biggest improvement then was to provide some efficient implementation for this heavily used function when the semantics is not used, bypassing (almost) completely the class hierarchy. There are also some problems for constructors that need to be called even if the semantics are disabled: TransformSemantics
is one of them.
For the future, I can think of some works that could improve the situation and reduce the number of workaround:
Some changes of the perfImprovements
branch (memory in fixed vector now is not initialized) spilled in the master branch. Not a big problem but we should do an announcement of this because it could break user code.
@traversaro for some reason, I got a mail from github with a truncated version of your comment listing the workarounds. Anyway, I had no trace of the question on CRTP hierarchy, sorry for the delay.
Can the CRTP hierarchy be simplified ? Currently it has depth 4 if I am not mistaken.
Are you referring to the CRTP hierarchy in the core classes GeomVector3
, ForceVector3
, AngularForceVector3
and so on? Well actually, the depth is 3. I wouldn't say it's that complicated. I would give 2 reasons for keeping that hierarchy:
GeomVector3
which holds most of the common code.MotionVector3
allows to communalise the code for the cross product.
What we could do is to remove the constructor definitions in ForceVector3
and MotionVector3
since we don't need to instantiate these classes as long as we define AngularForceVector3
, LinearForceVector3
,... constructors based directly on GeomVector3
constructors, which would reduce indirections and improve performance.Yes, I did not meant to simplify the CRTP hierarchy in the code, more in the "amount of empty constructors/destructors" that are called for each object that is created/destroyed .
This can be done in several ways, while keeping the hierarchy as it is.
fully agree
Most important changes happening in the perfImprovements
branch now documented in https://github.com/robotology/idyntree/blob/perfImprovements/MIGRATING.md .
Over the last months we implemented the new data structures to implement all our algorithm without depending anymore on KDL. From the performance point of view this process was relatively in open loop... and now it is time to close the loop. : ) I committed a basic benchmark in https://github.com/robotology/idyntree/commit/520df5b2ef9696f541819c0a192c3080186db309 , the results on my pc with a random load are :
So in average the pure iDynTree implementation are now one order of magnitude slower then the previous implementation. I actually have on which is the bottleneck that is causing this slowdown (in a nutshell : the old joint system was caching the joint transforms, while the new systems is computing them everytime). However I prefer to be methodologically and see the results of running this benchmark on some profiling tools and see if that is actually the bottleneck, and then solve it.