laas / metapod

A template-based robot dynamics library
GNU Lesser General Public License v3.0
14 stars 10 forks source link

simplify the tree structure #24

Closed sbarthelemy closed 11 years ago

sbarthelemy commented 11 years ago

There are a few things that annoy me in the current metapod structure:

  1. model and storage mixup: we could separate the kinematic model (tree of joints), the state-independant parameters (masses and lengths) and the stuff which depends on the state.
  2. joint macros: I see no point in duplicating macros in each joint. Each joint type could be a static class (once the separation mentioned above is effective).
  3. tree structure:
    • the children can be accessed from the Node tree, yet the parent is accessed from Body::Parent.
    • The joint is accessed from both Node::Joint and Node::Body::Joint.

Proposed solution

  1. Split data:
    • Put all the state dependant data in a storage array (members to be defined). As a bonus, I think this would make algorithms more similar to Featherstone book.
    • let the constant parameters be initialized at run time. (I do not expect this would impact performance. Am I overlooking something ?)
  2. Put each joint type in a static class (say, FloatingJoint, RevoluteJoint...)
  3. I like the Node tree, but have not succeeded yet to add a "parent" typedef to the tree. So I plan to drop it and use a flat set of classes which would merge the data from the current Node and Body classes. These classes would not be templated, but macro-generated. For instance, for a 6+3 dof robot, we would get:

    class NodeTorso { typedef NoParent parent; typedef FloatingJoint joint; typedef NodeLShoulder child0; typedef NodeHead child1; static const idx = 0; } class NodeLShoulder { typedef NodeTorso parent; typedef RevoluteJoint joint; typedef NodeLArm child0; typedef NoChild child1; static const idx = 1; } class NodeLArm { typedef NoLShoulder parent; typedef RevoluteJoint joint; typedef NoChild child0; typedef NoChild child1; static const idx = 2; } class NodeHead { typedef NodeTorso parent; typedef RevoluteJoint joint; typedef NoChild child0; typedef NoChild child1; static const idx = 3; }

    I'll also probably add a "NodeRoot" on top of that hierachy to fix issue #14. Alternatively, on could generate a single class templated by the index, but I do not see what we would gain or loose.

Comments ?

olivier-stasse commented 11 years ago

Hi Sébastien, Thanks for all those interesting comments.

Regarding point 1,

Regarding point 2, again I am afraid it is going against our current approach which aim at doing symbolic computation during compile time. In this spirit class are instances of templates, which hold the static data. If you think you have a good technical alternative I will be glad to discuss about it. We tried GINAC, and got the same result that what is currently doing Metapod.

Regarding point 3, introducing a noderoot is a good idea. The tree is written to perform a recursive substitution which equivalent to symbolic computation. Of course, if we apply subpoint 1.2 and point 2 then point 3 makes perfect sense, but the primary call of Metapod was for us to replace Mapple by meta-programming techniques.

olivier-stasse commented 11 years ago

Hi Sébastien, by mistake I removed your comments, this a copy paste of what was send by email, sorry for this.


Hello Olivier,

thank you for you comments.

Regarding point 1, For subpoint 1, I completely agree, the lost in efficiency should be marginal.

ok

For subpoint 2, I think it is going against the concept of metapod, where here the compiler is doing as much as >possible pre-computation when the data are available. I am afraid that we will lose all the benefit of the current >structure. But if you have an implementation much to your taste and with the same efficiency I will gladly switch.

I agree that there is potential efficiency gain when more is known at compile time, but I doubt metapod takes any benefit from this knowledge currently. What do you think?

On the other hand, initializing masses and lengths at runtime would allow me to account for slight differences between robot versions, with a single metapod models.

It would be perfectly acceptable for me to keep most of the model fixed a compile time except for a few parameters (mass of the head...), which would be fixed at run time (init). But this is more work, for (I think) no gain at the moment. I might try and we'll see.

Regarding point 2, again I am afraid it is going against our current approach which aim at doing symbolic >computation during compile time. In this spirit class are instances of templates, which hold the static data. If you >think you have a good technical alternative I will be glad to discuss about it.

I do not think the Xt transform belongs to the joint (both conceptually and in Featherstone's book), we could move them to runtime (cfr. 1.1) or to the Node class. There is not much left then. The indexes can be kept in the Node class too. The only problematic data I can think of is the axis of JOINT_REVOLUTE_ANY. I have yet to find a solution for this one.

We tried GINAC, and got the same result that what is currently doing Metapod.

I have not checked the ginac branch. But a quick bench here showed that the inertia matrix computation from metapod is 3 times slower than the one from HuMAnS, where masses and lengths are passed over at runtime. This was with a NAO model, on a atom CPU, with single precision computations.

Regarding point 3, introducing a noderoot is a good idea.

ok

The tree is written to perform a recursive substitution which equivalent to symbolic computation. Of course, if we >apply subpoint 1.2 and point 2 then point 3 makes perfect sense,

I think the classes and typedefs I propose here would still form a tree which can be visited recursively at compile time. It just puts parent and children together which makes sense IMHO. Apparently some advanced template trickery would let us reach parents while keeping a real tree of types, be I could not wrap my mind around it yet (and I'm not sure it is worth it).

but the primary call of Metapod was for us to replace Mapple by meta-programming techniques.

Do you have reference material on the techniques enabling that?

My understanding is that, in its current state, metapod has good performance because it uses efficient algorithms which avoid recalculations, unrolls loops and uses Eigen at its best with fixed-size blocks.

But I had no hint of anything symbolic nor CAS-like. But I'm not CAS- nor symbolic-literate. Could you clarify that point?

olivier-stasse commented 11 years ago

I have not checked the ginac branch. But a quick bench here showed that the inertia matrix computation from metapod is 3 times slower than the one from HuMAnS, where masses and lengths are passed over at runtime. This was with a NAO model, on a atom CPU, with single precision computations.

That is VERY interesting because the whole origin of metapod came from our use of HuManS to compute the RNEA algorithm. With Pierre-Brice we were convince that we could do the same without Mapple. In HuManS most of the computation for the RNEA-like quantities are realized by expanding recursively the tree of computation. If you use very aggressive compiling option (and we did for gcc) most of the static computation are precompiled. For this specific part, we had a better improved ratio with Metapod than the code generated with HuManS.

Thanks for pointing the problem regarding the inertia matrix. I did not follow the CRBA algorithm as well as the RNEA. I am VERY interested by whatever technique that could improve our computation time.

sbarthelemy commented 11 years ago

My first benchmark was with the following flags

-pipe -fomit-frame-pointer -fno-align-jumps -fno-align-functions
-fno-align-labels -fno-align-loops -m32 -mtune=generic -DOE_CROSS_BUILD
-O3 -DNDEBUG -fPIC

I did it again with these ones

-pipe -fomit-frame-pointer -fno-align-jumps -fno-align-functions
-fno-align-labels -fno-align-loops -m32 -mtune=atom -mssse3 -mfpmath=sse
-DOE_CROSS_BUILD -O3 -DNDEBUG -fPIC

it's getting better. CRBA from metapod is now "only" 2.2 times slower than the one from HuMAnS, which took no benefit from the new flags.

All of this is with gcc 4.5.3

sbarthelemy commented 11 years ago

I built a small prototype. It is there

https://gist.github.com/4130724

Separating between data which depends on the state and which does not was not as simple as I (naively) expected. For instance, the joint S matrix differs between joints both regarding its type and its constness.

This as two consequences.

I ended up with one class per type of joint. So

I also added a getter (S()) but we may remove it.

In the tree of nodes there is a typedef which tells which joint class to use for each node. This tree is a pure type, not an object (just like is is in metapod currently).

Yet, in addition, we need a container of joint instances. For this I used a boost::fusion::vector, indexed by the node index. It contains one object per joint, of the proper type. (For what I can tell, this vector provides the functionality of a tuple, and is implemented with a struct, having one data member per element).

The algorithms would operate on the static tree like before, but take the joints vectors as an additional argument.

I think that, doing so, we do not prevent any compile-time optimization: the compiler still has all the pieces of informations it used to. (But I cannot tell for sure.)

On the plus side, we have a single class per joint type, and the Robot is not a big singleton anymore.

I had no great idea reagarding the REVOLUTE_AXIS_ANY joint. Either we macro-generate one class per different axis or we pass the axis as a non-const non-static data member, at the cost of some efficiency.

What do you think?

olivier-stasse commented 11 years ago

Thanks Sébastien, I will have a look. As a first remark, I think that the main reason of our current lack of efficiency is related to the way we handle S. I am currently building a prototype myself but a different approach, where basically the joints will be templated.

I have not yet a workable version.

sbarthelemy commented 11 years ago

Hi,

I pushed branch fusion0 for review.

It is still not fully satisfying but it removes all the state-dependent static data and passes the tests with no performance regression. Si commit logs for details.

All algorithms are implemented except for the jacobians, I'd like to get some feedback on what is already there before porting them too.

This branch also fixes issue #14.

The three first commits fix unrelated problems and could be merged right-away.

aelkhour commented 11 years ago

I tested the branch fusion0 and, while I did not notice any performance regression for simplehumanoid, I did for the other models sample*. For instance:

before:

Model NBDOF : 31 average execution time : jcalc: 3.38516µs bcalc: 5.18253µs rnea: 14.2105µs rnea (without jcalc): 10.6574µs crba: 13.9336µs crba (without jcalc): 10.0794µs jac_point_robot (without bcalc): 20.3049µs

after:

Model NBDOF : 31 average execution time : jcalc: 3.41115µs bcalc: 4.90274µs rnea: 17.2601µs rnea (without jcalc): 13.6959µs crba: 13.504µs crba (without jcalc): 9.74358µs

I do not have an explanation for this. I tried to run a dedicated benchmark for the rnea without jcalc and do some profiling, but in this case the computation times became almost the same...

Other comments:

  • boost::fusion limits us to 50 joints (there are ways to circumvent this limitation or boost::fusion as

There seems to be a part missing at the end of this message.

  • the getTorques function (from tools/print.hh) was removed (is it useful?)

It is if we want to retrieve a vector containing all computed torques with rnea. Is there currently any other alternative?

  • the dotS member of the joints was removed since it was unused. Is this expected?

dotS is used for example in computing the acceration term cj (see include/tools/joint.hh) in the rnea. It is equal to the apparent time derivative of S; it is therefore equal to zero as long as the S matrix is constant. This is sadly not the case for our current free-flyer joint (Featherstone proposes another formulation where the S matrix of the free-flyer joint is constant), which means that the current free-flyer jcalc method is bugged, and dotS must be computed correctly. I can take care of fixing this bug, but I need to know on which branch to push it.

sbarthelemy commented 11 years ago

Hello Antonio, thank you for your time and feedback.

I do not have an explanation for this. I tried to run a dedicated benchmark for the rnea without jcalc and do some profiling, but in this case the computation times became almost the same...

I'll have a look.

boost::fusion limits us to 50 joints (there are ways to circumvent this limitation or boost::fusion as There seems to be a part missing at the end of this message.

sure. Alas, I cannot recall what I meant ;)

the getTorques function (from tools/print.hh) was removed (is it useful?) It is if we want to retrieve a vector containing all computed torques with rnea. Is there currently any other alternative?

not any, besides writing your own visitor to do the job. I'll add it back.

the dotS member of the joints was removed since it was unused. Is this expected?

dotS is used for example in computing the acceration term cj (see include/tools/joint.hh) in the rnea.

on the master version, I could not find where cj is initialized. No jcalc seems to use it.

It is equal to the apparent time derivative of S; it is therefore equal to zero as long as the S matrix is constant. This is sadly not the case for our current free-flyer joint (Featherstone proposes another formulation where the S matrix of the free-flyer joint is constant), which means that the current free-flyer jcalc method is bugged, and dotS must be computed correctly.

agreed.

I can take care of fixing this bug,

ok

but I need to know on which branch to push it.

I think you should push it on master. I'll port the fix on fusion0.

aelkhour commented 11 years ago

The dotS matrix and the cj acceleration term are now computed correctly for free-flyer joints. I also updated the rnea reference file for simple_humanoid as torques values are now different.

I did not notice any performance regression as we only have one free-flyer joint in our simple_humanoid model. Nevertheless, note that dotS is, just like S, a 6x6 block-sparse matrix in the free-flyer joint case, so its product with dq is still not yet optimal.

If we wanted optimal computations, there are two solutions in my opinion:

sbarthelemy commented 11 years ago

On Wed, Jan 23, 2013 at 10:24 AM, Antonio El Khoury < notifications@github.com> wrote:

  • Use a different representation of free-flyer joint coordinates. In fact, according to Featherstone's book, expressing the translation coordinates in the successor frame (as opposed to the parent frame for now) ensures that the S matrix is constant, thus making dotS and cj computation unnecessary.

This was done this way in arboris and sounds more natural to me. I thought the current way was chosen for compatibility with jrl-dynamics. Maybe I'm wrong?

aelkhour commented 11 years ago

I thought the current way was chosen for compatibility with jrl-dynamics.

That is correct.

Ideally I think metapod should support several types of free-flyer joints where:

sbarthelemy commented 11 years ago

Hello Antonio,

I tested the branch fusion0 and, while I did not notice any performance regression for simplehumanoid, I did for the other models sample*.

I could reproduce the problem

I do not have an explanation for this. I tried to run a dedicated benchmark for the rnea without jcalc and do some profiling, but in this case the computation times became almost the same...

I tried with rnea without jcalc too. When the benchmark is run on a single sample model (even multiple times) there is no regression. When they are all included, there is a regression, but not systematically: one time over four, the performance is on par with laas/master.

I had similar problems with the visitors (see 38d0adb0c28f45d024e025d2a88de5b13677c52a) and thought the problem had something to do with symbols resolution and possibly the dynamic linker.

I tried to make the sample models libraries static and also to shorten the model names, to no result.

Quite puzzling.

olivier-stasse commented 11 years ago

There is a small problem with the benchmarks. There is two loops and the q, dq, ddq vectors are randomly initialized outside the inner loop, i.e. once every 1000 iterations. If we do no initialize randomly at each iteration, outside the timer, the bench are less good in comparison with Humans.

sbarthelemy commented 11 years ago

There is a small problem with the benchmarks. There is two loops and the q, dq, ddq vectors are randomly initialized outside the inner loop, i.e. once every 1000 iterations. If we do no initialize randomly at each iteration, outside the timer, the bench are less good in comparison with Humans.

Can you detail how the HuMAnS benchmarks are better?

Keep in mind that we are limited by the timer resolution. With a resolution of 0.5us [1] we should no try to measure anything shorter than 50us.

We currently stop the timer during the input generation and restart it afterward, I do not think we could do that at each iteration. Maxime detailed the reasoning in the comments.

[1] http://www.boost.org/doc/libs/1_50_0/libs/timer/doc/cpu_timers.html#Timer-accuracy

olivier-stasse commented 11 years ago

Good point. Despite the claims by time(7) and the HIGH RESOLUTION TIMER flag of the linux kernel pretending to a 1 nanosecond precision (as well as /proc/timer_list), a run of cpu_timer_info (given in your reference link) gives 486 ns on my i7core. However, my point is much more on the fact that q, dq, ddq are not reinitialized. It happens frequently that the compiler finds out that data are not the same over a set of iterations and do not compute everything. I'll try to come up with a test to find out if this is relevant or not.

sbarthelemy commented 11 years ago

(as well as /proc/timer_list)

I get 1 ns reported there too

a run of cpu_timer_info (given in your reference link) gives 486 ns

I get around 150 ns on my i7-3770 CPU @ 3.40GHz

However, my point is much more on the fact that q, dq, ddq are not reinitialized. It happens frequently that the compiler finds out that data are not the same over a set of iterations and do not compute everything. I'll try to come up with a test to find out if this is relevant or not.

ok.

I have more hints on the performance regression.

I did the following:

after-all-statictimer exhibits the regression

$ ./benchmark-after-all-statictimer 
*************
Model NBDOF : 3
  average execution time :
rnea (without jcalc): 1.02278µs
*************
Model NBDOF : 7
  average execution time :
rnea (without jcalc): 5.92993µs
*************
Model NBDOF : 15
  average execution time :
rnea (without jcalc): 7.94144µs
*************
Model NBDOF : 31
  average execution time :
rnea (without jcalc): 7.19926µs

after-same-statictimer does not

$ ./benchmark-after-same-statictimer 
*************
Model NBDOF : 31
  average execution time :
rnea (without jcalc): 3.44593µs
*************
Model NBDOF : 31
  average execution time :
rnea (without jcalc): 3.47723µs
*************
Model NBDOF : 31
  average execution time :
rnea (without jcalc): 3.47494µs
*************
Model NBDOF : 31
  average execution time :
rnea (without jcalc): 3.46475µs

the compiler apparently optimises differently in both cases. Specifically, the update_kinematics function, which is part of rnea is not fully inlined in after-all-statictimer.

$ nm --demangle  benchmark-after-all-statictimer | grep update_kinematics|wc -l
42
$ nm --demangle  benchmark-after-same-statictimer|grep update_kinematics|wc -l
0

This also sheds a different light on my earlier findings: not instantiating the algorithms in the model lib reduced the number of symbols but that probably had no impact on performance. What had impact I think was that the compiler then made the decision on what part of the algorithms to inline or not when building the benchmark instead of doing so when building the model.

I tried adding METAPOD_HOT and "inline" to update_kinematics, but it did not change anything. I also tried to add the "flatten" attribute and some models became a bit faster, others slower. Anyway it was not enough to recover the lost performance.

I'll try to remove/fix that update_kinematics function.

We should probably think a bit more on what we want to carry in the libraries and the benchmarks.

Anyway, this should probably not stop us from merging the fusion branch, in some sense, the regression is already in master. It is just optimised out in the benchmark compilation.

olivier-stasse commented 11 years ago

Thanks a lot for your findings, I'll try to reproduce them. My CPU is Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz

The following numbers are given with my CPU.

Regarding the random generators, I made a measurement which evaluates to 1.67 us in average the vector generation. If we generate the vectors randomly and measure the generation + rnea the overall time amount to 7.9 us. If we remove the 1.67 us then it gives 6.23 us, while the frame implemented by Maxime gives 6.28 us. So it is correct, but we have to make sure that the compiler is not bypassing iterations.

olivier-stasse commented 11 years ago

One reason why inline is not necessary, is that in the RELEASE build mode, cmake usually put the O3 flag. If you look at gcc manual for 03 it it specified that all the methods are considered for inline, and the compiler decide on its own if it is worth it.

sbarthelemy commented 11 years ago

ok, I did the same changes (static libs) to the version before the fusion changes, and has explected the regression is not there yet

$ ./benchmark-before-all-statictimer 
*************
Model NBDOF : 3
  average execution time :
rnea (without jcalc): 0.304495µs
*************
Model NBDOF : 7
  average execution time :
rnea (without jcalc): 0.74045µs
*************
Model NBDOF : 15
  average execution time :
rnea (without jcalc): 1.63206µs
*************
Model NBDOF : 31
  average execution time :
rnea (without jcalc): 3.3931µs

however, the compiler apparently did the same inlining choices:

$ nm --demangle ./benchmark-before-all-statictimer | grep update_kinematics | wc -l
42

I also tried to add hot and always_inline to all function from rnea and depth-first traversal. I called this version after-all-statictimer-always.

nm shows that the functions were indeed inlined. The benchmark is annoying:

$ ./benchmark-after-all-statictimer-always 
*************
Model NBDOF : 3
  average execution time :
rnea (without jcalc): 1.00686µs
*************
Model NBDOF : 7
  average execution time :
rnea (without jcalc): 6.09337µs
*************
Model NBDOF : 15
  average execution time :
rnea (without jcalc): 4.86081µs
*************
Model NBDOF : 31
  average execution time :
rnea (without jcalc): 4.72034µs

The bigger model as regained most of the lost performance, but the smaller have not.

I'm puzzled back.

sbarthelemy commented 11 years ago

I just pushed https://github.com/sbarthelemy/metapod/tree/fusion1

This is just like fusion0 except that

before pushing, I compiled and ran the benchmarks, and the performance regression vanished. Do not ask me why...

sbarthelemy commented 11 years ago

before pushing, I compiled and ran the benchmarks, and the performance regression vanished. Do not ask me why...

ok, it turns out the regression is not there in fusion1. However if we make the model and timer libs static and remove the simple_humanoid model (let call this configuration fusion1-all-statictimer), the regression comes back. (I pushed the two corresponding patches).

I ran both versions benchmarks under cachegrind.

It reports 44G instructions for qfusion1, and 59G instructions for qfusion1-all-statictimer (and the latter does less computations since it does not include simple_humanoid!). Note also that there are less cache misses for the fusion1-all-statictimer version but this is apparently not enough to compensate for the huge increase in the number of instructions.

Details follow (qfusion1 then qfusion1-all-statictimer)

--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         8388608 B, 64 B, 16-way associative
Command:          metapod/sdk/bin/benchmark
Data file:        cachegrind.out.qfusion1
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
            Ir        I1mr  ILmr             Dr       D1mr  DLmr            Dw        D1mw  DLmw 
--------------------------------------------------------------------------------
44,390,863,912 209,801,192 6,710 15,576,717,984 93,845,510 8,218 8,190,956,938 103,355,088 3,120  PROGRAM TOTALS
--------------------------------------------------------------------------------
59,317,444,214 182,111,803 3,688 17,667,898,649 45,049,989 7,246 10,082,682,048 22,026,099 2,421  PROGRAM TOTALS

Maybe we should think a bit more one what we want to carry in the mode libraries and in the benchmarks. We could for instance build one "benchmark lib" per model and call them all from the executable, that would be closer to a real use case. We could also probably reduce the number of templated methods by reworking the structure a bit: some calculations can probably be abstracted from the node class.

sbarthelemy commented 11 years ago

Bon rien à voir, dans le cas qfusion1-all-statictimer, le bench est lancé 10x de plus.

sbarthelemy commented 11 years ago

ok, I think I eventually got it.

Please do not take into consideration my (previous remarks)[https://github.com/laas/metapod/issues/24#issuecomment-12697582] about dynamic vs static libraries. As you can see if you compare the tags, the two versions do not perform the same computations at all, so comparing their numbers of instructions is nonsensical.

I think the performance regression has everything to do with uninitialized data. My guess is that among this data was a lot of denormalised floating-point numbers which involve much slower computations.

This explains why the regression is fixed in fusion1.

Yet, why does it appear again in qfusion1-all-statictimer? I think it is because in this version, the benchmark only calls the "rnea without jcalc" algorithm. Since jcalc is never called, it again operates on uninitialized data.

To confirm this theory, I first created the qfusion1-rneanojcalc tag which calls the "rnea without jcalc" algorithm alone but keeps everythin else unchanged (dynamic libs...). It does exhibit the performance regression. I then instructed the CPU to flush the denormals to zero, and regained the lost performance (see tag qfusion1-rneanojcalc-ftz).

Calling jcacl once for the whole benchmark also restores the lost performance.

So, we just had bad luck (Antonio and I) when we picked the "rnea (without jcalc)" algorithm for the purpose of benchmarking.

olivier-stasse commented 11 years ago

This really make sense. Denormalized data can be indeed very costly. There is a CPU instruction which makes the floating point unit set to zero a number is too "far away" from this other operand. Regarding the initialization at zero, there is a linux kernel option which forces the system to randomly initliaze a newly allocated memory space to detect this case. Anyway thanks for your investigation.

olivier-stasse commented 11 years ago

Hi all, Can we consider that the current state of master closes this issue ?

aelkhour commented 11 years ago

Fine with me.

On Tue, Jan 29, 2013 at 7:49 AM, Olivier Stasse notifications@github.comwrote:

Hi all, Can we consider that the current state of master closes this issue ?

— Reply to this email directly or view it on GitHubhttps://github.com/laas/metapod/issues/24#issuecomment-12822709.