Optimize the new iDynTree data structures/algorithms

traversaro commented 8 years ago

Over the last months we implemented the new data structures to implement all our algorithm without depending anymore on KDL. From the performance point of view this process was relatively in open loop... and now it is time to close the loop. : ) I committed a basic benchmark in https://github.com/robotology/idyntree/commit/520df5b2ef9696f541819c0a192c3080186db309 , the results on my pc with a random load are :

pegua@pareto:~/src/codyco-superbuild/build/libraries/iDynTree$ ./src/tests/benchmark/DynamicsBenchmark 
Dynamics benchmark, iDynTree built in Release mode 
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/oneLink.urdf
KDL-based implementation : 
    RNEA average time 0.290871 microseconds
    CRBA average time 0.33617 microseconds
iDynTree-based implementation : 
    RNEA average time 3.14713 microseconds
    CRBA average time 0.298023 microseconds
iDynTree/KDL ratio : 
    RNEA average time 10.8197
    CRBA average time 0.886525
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/twoLinks.urdf
KDL-based implementation : 
    RNEA average time 1.14679 microseconds
    CRBA average time 2.15292 microseconds
iDynTree-based implementation : 
    RNEA average time 20.0534 microseconds
    CRBA average time 8.07524 microseconds
iDynTree/KDL ratio : 
    RNEA average time 17.4865
    CRBA average time 3.75083
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/threeLinks.urdf
KDL-based implementation : 
    RNEA average time 1.83582 microseconds
    CRBA average time 3.4523 microseconds
iDynTree-based implementation : 
    RNEA average time 38.9028 microseconds
    CRBA average time 21.5983 microseconds
iDynTree/KDL ratio : 
    RNEA average time 21.1909
    CRBA average time 6.25622
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/twoLinksFixed.urdf
KDL-based implementation : 
    RNEA average time 0.760555 microseconds
    CRBA average time 0.972748 microseconds
iDynTree-based implementation : 
    RNEA average time 8.45671 microseconds
    CRBA average time 1.74522 microseconds
iDynTree/KDL ratio : 
    RNEA average time 11.1191
    CRBA average time 1.79412
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icubTwoLinks.urdf
KDL-based implementation : 
    RNEA average time 0.505447 microseconds
    CRBA average time 1.02758 microseconds
iDynTree-based implementation : 
    RNEA average time 10.3426 microseconds
    CRBA average time 4.24862 microseconds
iDynTree/KDL ratio : 
    RNEA average time 20.4623
    CRBA average time 4.13457
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icub.urdf
KDL-based implementation : 
    RNEA average time 12.6314 microseconds
    CRBA average time 36.0465 microseconds
iDynTree-based implementation : 
    RNEA average time 269.737 microseconds
    CRBA average time 375.643 microseconds
iDynTree/KDL ratio : 
    RNEA average time 21.3545
    CRBA average time 10.4211
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icub_model.urdf
KDL-based implementation : 
    RNEA average time 15.1253 microseconds
    CRBA average time 39.7229 microseconds
iDynTree-based implementation : 
    RNEA average time 302.987 microseconds
    CRBA average time 386.999 microseconds
iDynTree/KDL ratio : 
    RNEA average time 20.0318
    CRBA average time 9.74245
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icub_skin_frames.urdf
KDL-based implementation : 
    RNEA average time 14.8702 microseconds
    CRBA average time 37.3697 microseconds
iDynTree-based implementation : 
    RNEA average time 290.613 microseconds
    CRBA average time 369.844 microseconds
iDynTree/KDL ratio : 
    RNEA average time 19.5434
    CRBA average time 9.8969
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/bigman.urdf
KDL-based implementation : 
    RNEA average time 18.5442 microseconds
    CRBA average time 38.1804 microseconds
iDynTree-based implementation : 
    RNEA average time 342.174 microseconds
    CRBA average time 349.6 microseconds
iDynTree/KDL ratio : 
    RNEA average time 18.4518
    CRBA average time 9.15655
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/robotModelTestTwoLinks.urdf
KDL-based implementation : 
    RNEA average time 0.455379 microseconds
    CRBA average time 1.01328 microseconds
iDynTree-based implementation : 
    RNEA average time 10.2782 microseconds
    CRBA average time 4.24147 microseconds
iDynTree/KDL ratio : 
    RNEA average time 22.5707
    CRBA average time 4.18588
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icub2BB5Sea.urdf
KDL-based implementation : 
    RNEA average time 25.4083 microseconds
    CRBA average time 42.6245 microseconds
iDynTree-based implementation : 
    RNEA average time 396.819 microseconds
    CRBA average time 285.847 microseconds
iDynTree/KDL ratio : 
    RNEA average time 15.6177
    CRBA average time 6.70618
pegua@pareto:~/src/codyco-superbuild/build/libraries/iDynTree$ ./src/tests/benchmark/DynamicsBenchmark 
Dynamics benchmark, iDynTree built in Release mode 
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/oneLink.urdf
KDL-based implementation : 
    RNEA average time 0.216961 microseconds
    CRBA average time 0.193119 microseconds
iDynTree-based implementation : 
    RNEA average time 1.87397 microseconds
    CRBA average time 0.195503 microseconds
iDynTree/KDL ratio : 
    RNEA average time 8.63736
    CRBA average time 1.01235
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/twoLinks.urdf
KDL-based implementation : 
    RNEA average time 0.61512 microseconds
    CRBA average time 1.17302 microseconds
iDynTree-based implementation : 
    RNEA average time 11.301 microseconds
    CRBA average time 4.63963 microseconds
iDynTree/KDL ratio : 
    RNEA average time 18.3721
    CRBA average time 3.95528
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/threeLinks.urdf
KDL-based implementation : 
    RNEA average time 0.734329 microseconds
    CRBA average time 1.52111 microseconds
iDynTree-based implementation : 
    RNEA average time 17.6311 microseconds
    CRBA average time 9.51767 microseconds
iDynTree/KDL ratio : 
    RNEA average time 24.0097
    CRBA average time 6.25705
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/twoLinksFixed.urdf
KDL-based implementation : 
    RNEA average time 0.445843 microseconds
    CRBA average time 0.579357 microseconds
iDynTree-based implementation : 
    RNEA average time 5.42402 microseconds
    CRBA average time 1.00136 microseconds
iDynTree/KDL ratio : 
    RNEA average time 12.1658
    CRBA average time 1.7284
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icubTwoLinks.urdf
KDL-based implementation : 
    RNEA average time 0.572205 microseconds
    CRBA average time 1.02282 microseconds
iDynTree-based implementation : 
    RNEA average time 12.0878 microseconds
    CRBA average time 4.25577 microseconds
iDynTree/KDL ratio : 
    RNEA average time 21.125
    CRBA average time 4.16084
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icub.urdf
KDL-based implementation : 
    RNEA average time 12.126 microseconds
    CRBA average time 35.0285 microseconds
iDynTree-based implementation : 
    RNEA average time 281.32 microseconds
    CRBA average time 367.453 microseconds
iDynTree/KDL ratio : 
    RNEA average time 23.1998
    CRBA average time 10.4901
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icub_model.urdf
KDL-based implementation : 
    RNEA average time 14.3313 microseconds
    CRBA average time 37.5533 microseconds
iDynTree-based implementation : 
    RNEA average time 291.858 microseconds
    CRBA average time 392.578 microseconds
iDynTree/KDL ratio : 
    RNEA average time 20.365
    CRBA average time 10.4539
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icub_skin_frames.urdf
KDL-based implementation : 
    RNEA average time 13.7496 microseconds
    CRBA average time 36.5472 microseconds
iDynTree-based implementation : 
    RNEA average time 284.963 microseconds
    CRBA average time 358.534 microseconds
iDynTree/KDL ratio : 
    RNEA average time 20.7252
    CRBA average time 9.81016
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/bigman.urdf
KDL-based implementation : 
    RNEA average time 18.0888 microseconds
    CRBA average time 37.2195 microseconds
iDynTree-based implementation : 
    RNEA average time 331.836 microseconds
    CRBA average time 346.487 microseconds
iDynTree/KDL ratio : 
    RNEA average time 18.3448
    CRBA average time 9.30927
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/robotModelTestTwoLinks.urdf
KDL-based implementation : 
    RNEA average time 0.531673 microseconds
    CRBA average time 1.02043 microseconds
iDynTree-based implementation : 
    RNEA average time 10.3855 microseconds
    CRBA average time 4.22239 microseconds
iDynTree/KDL ratio : 
    RNEA average time 19.5336
    CRBA average time 4.13785
Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/icub2BB5Sea.urdf
KDL-based implementation : 
    RNEA average time 25.2318 microseconds
    CRBA average time 41.6279 microseconds
iDynTree-based implementation : 
    RNEA average time 395.868 microseconds
    CRBA average time 280.576 microseconds
iDynTree/KDL ratio : 
    RNEA average time 15.6892
    CRBA average time 6.74009

So in average the pure iDynTree implementation are now one order of magnitude slower then the previous implementation. I actually have on which is the bottleneck that is causing this slowdown (in a nutshell : the old joint system was caching the joint transforms, while the new systems is computing them everytime). However I prefer to be methodologically and see the results of running this benchmark on some profiling tools and see if that is actually the bottleneck, and then solve it.

traversaro commented 8 years ago

A preliminary profiling with the valgrind tool callgrind shows that 20% of the instructions is used by Joint::GetTransform() (confirming the intuition discussed in the previous comment) and 10% of the instructions is used by the empty constructor of Vector3 (because its content is initialized to 0):

Note that 10% of instructions does not account to 10% of the execution time: on modern architectures most of the execution time is lost on cache misses and branch misprediction. To profile this aspects you can use the advanced functionality of cachegrind, and collect profiling data on the actual execution of the program using perf (on Linux).

iron76 commented 8 years ago

:+1: excellent test!

traversaro commented 8 years ago

Another big bottleneck is apparently all the virtual destructor of "basic" classes such as VectorFixSize , Position , Transform, SpatialVector, etc etc .

All this classes where explicitly designed such that they could be easily used as local variable to a function (and returned by value/copied) with minimal overhead. However we forgot something:

screenshot - 23112015 - 01 21 44

In this screenshot you can see that the iDynTree::AngularForceVector3::~AngularForceVector3() accounts for the incredible number of ~126 millions of instructions, while (as a comparison) the CRBA (i.e. calculation of the mass matrix) with the old implementation alone is just ~200 millions of instructions (this are the numbers for a big number of runs of the algorithms). The RNEA with the old implementation instruction count is so low that is not part of the screenshot at all (the function are sorted by the instruction count!).

The big problem is that the iDynTree::AngularForceVector3::~AngularForceVector3() destructors is not doing anything at all! The problem is that the compiler is calling the virtual destructor of all its base class. Even the iDynTree::IVector::~IVector() destructor accounts for ~50 Millions instructions! The solution to this is to transform all the destructors of interfaces that are not meant for polymorphism (so all the interfaces, except IJoint and ISensor) to be protected and not-virtual (as opposed to be public and virtual. [1] [2] We need then to provide non virtual destructors that do nothing (the default destructor should be ok) to all classes that are meant to be used efficiently as local variables in function (the one that we listed before). We should also be sure that this destructors are properly inline by the compiler, such that the destrunction of a class of those types have zero overhead.

cc @francesco-romano

[1] : http://en.cppreference.com/w/cpp/language/destructor (See "Virtual destructors" section) [2] : http://www.gotw.ca/publications/mill18.htm (see "Guideline 4).

traversaro commented 8 years ago

I discussed with @francesco-romano and he is ok with what we discussed in https://github.com/robotology/idyntree/issues/98#issuecomment-158823148 . I think I will first implement this and the removing the initialization in VectorFixSize and MatrixFixSize, because they are clearly performance bugs, while the caching of transforms in joint classes is a bit more complicated and controversial (it basically breaks thread safety of the Joint classes .

DanielePucci commented 8 years ago

I like this:

... this process was relatively in open loop... and now it is time to close the loop.

traversaro commented 8 years ago

First attempt to remove setters to zero in the default constructor :

63% tests passed, 24 tests failed out of 65

Just broke the 40 % of the tests, not bad! :dancers: (Including also the valgrind tests that detect any use of uninitialized memory).

traversaro commented 8 years ago

100% tests passed, 0 tests failed out of 66

:-)

traversaro commented 8 years ago

First round of improvements was done by removing initialization of memory in constructors : https://github.com/robotology/idyntree/commit/2d091137e90ab7a9919a5edc0f3781f825c75146

Tests using Valgrind was added (and then fixed) to make sure we did not miss any use of not initialized objects:

traversaro commented 8 years ago

Everything is commited for now in the https://github.com/robotology/idyntree/compare/perfImprovements branch .

traversaro commented 8 years ago

Tomorrow I will remove the virtual hierarchy .

traversaro commented 8 years ago

We are getting closer:

KDL-based implementation : 
    RNEA average time 34.052 microseconds
    CRBA average time 55.274 microseconds
iDynTree-based implementation : 
    RNEA average time 212.471 microseconds
    CRBA average time 154.778 microseconds
iDynTree/KDL ratio : 
    RNEA ratio 6.2396
    CRBA ratio 2.8002

I just brought the RNEA slowdown from 9 to 6 by just removing all the empty constructors from most the "core" classes.. this may seems counterintuitive (why the compiler is not inlining the empty method, is such an obvious optimization!), but it is actually quite logic: as the core classes and the dynamics algorithms are in two different libraries, if a empty destructor (not inline) is defined in a core class, the compiler has no way of knowing it is empty until the linking, and CMake does not enables the LTO (Link Time Optimizations) even in Release mode!

traversaro commented 8 years ago

I started to optimize more in depth, reaching the semantics classes, for which I would ask to @nunoguedelha to briefly review my modifications.

The first problem was that semantics operation are optimized out in Release, but their initialization is not optimized out, this was solved in https://github.com/robotology/idyntree/commit/af5476d55a1175bd9484d13c7060cc2864e1c0a0 (expect for the TransformSemantics and SpatialVectorSemantics that store references and so need to be initialized, but this is not the bottleneck at the moment).

The next problem is much more tricky, and I need to write down what I am thinking because I lost track of the rationale of the changes that I wanted to make several times... nothing, lost again.

traversaro commented 8 years ago

With the last commit, we reached :

Benchmarking dynamics algorithms for /home/pegua/src/codyco-superbuild/libraries/iDynTree/src/tests/data/i
KDL-based implementation : 
    RNEA average time 34.352 microseconds
    CRBA average time 55.281 microseconds
iDynTree-based implementation : 
    RNEA average time 47.93 microseconds
    CRBA average time 52.908 microseconds
iDynTree/KDL ratio : 
    RNEA ratio 1.39526
    CRBA ratio 0.957074

As we are still missing a fundamental optimization, to consider the "fake links" present in the URDF as frames, and thus don't run the dynamics algoritms on them (see https://github.com/robotology/codyco-modules/issues/39) and the icub2BB5Sea.urdf is full of frames (MTB, EMS, etc etc). I think that even if we did not match perfectly the old implementation the first round of optimization can be considered finished.

The biggest issue in the end was that the class hierarchy is too deep, and apparently even if all constructors are empty, the compiler is not able to inline all them even by compiling all the library in the same translation unit.

wrenchbottleneck

In this image you can se how many functions (not inlined!!) are called for a single Wrench::operator+. As most of them just call some other function, I was expecting the compiler to inline them, but apparently this is not the case, at least in my version of GCC. The biggest improvement then was to provide some efficient implementation for this heavily used function when the semantics is not used, bypassing (almost) completely the class hierarchy. There are also some problems for constructors that need to be called even if the semantics are disabled: TransformSemantics is one of them.

For the future, I can think of some works that could improve the situation and reduce the number of workaround:

Get rid of *Raw classes
Get rid of SpatialAcc/Twist and Wrench/SpatialMomentum , and just have a type for 6D vector of type "motion" and of type "force"
Move the semantics class to be dynamically allocated. While this slows down the semantics checking, it could improve the Release behavior, reducing the number of constructor executed even if they are empty.
@nunoguedelha Can the CRTP hierarchy be simplified ? Currently it has depth 4 if I am not mistaken... perhaps @nunoguedelha could comment on that.

traversaro commented 8 years ago

Some changes of the perfImprovements branch (memory in fixed vector now is not initialized) spilled in the master branch. Not a big problem but we should do an announcement of this because it could break user code.

nunoguedelha commented 8 years ago

@traversaro for some reason, I got a mail from github with a truncated version of your comment listing the workarounds. Anyway, I had no trace of the question on CRTP hierarchy, sorry for the delay.

Can the CRTP hierarchy be simplified ? Currently it has depth 4 if I am not mistaken.

Are you referring to the CRTP hierarchy in the core classes GeomVector3, ForceVector3, AngularForceVector3 and so on? Well actually, the depth is 3. I wouldn't say it's that complicated. I would give 2 reasons for keeping that hierarchy:

We should definitely keep GeomVector3 which holds most of the common code.
MotionVector3 allows to communalise the code for the cross product. What we could do is to remove the constructor definitions in ForceVector3 and MotionVector3 since we don't need to instantiate these classes as long as we define AngularForceVector3, LinearForceVector3,... constructors based directly on GeomVector3 constructors, which would reduce indirections and improve performance.

traversaro commented 8 years ago

Yes, I did not meant to simplify the CRTP hierarchy in the code, more in the "amount of empty constructors/destructors" that are called for each object that is created/destroyed .

This can be done in several ways, while keeping the hierarchy as it is.

nunoguedelha commented 8 years ago

fully agree

traversaro commented 8 years ago

Most important changes happening in the perfImprovements branch now documented in https://github.com/robotology/idyntree/blob/perfImprovements/MIGRATING.md .

traversaro commented 8 years ago

Merged in https://github.com/robotology/idyntree/issues/129 .

robotology / idyntree

Optimize the new iDynTree data structures/algorithms #98