About Pinocchio code generation

jcarpent commented 3 years ago

Dear @plancherb1,

Thanks for sharing this set of benchmarks. I wonder why your are not also comparing with Pinocchio code gen support, which can be up to 10 times faster than standard CPU compiled code?

Best,

Justin

plancherb1 commented 3 years ago

Hi Justin,

Correct me if I did it wrong but I did try to use only the codegen for all of my timing. If you look at the file I used for timing https://github.com/robot-acceleration/GRiDBenchmarks/blob/main/timePinocchio.cpp I only used the standard functions to set up variables and to do print based testing of values. It should be all codegen for timing. I also built using the pinnochio3-preview branch as I assumed that would be the fastest, most up-to-date version of the code. And for multi threading I put together a small threading wrapper which I found outperformed threadpools and std::threads to make sure the CPU timing was as fast as I knew how to get it. Do let me know if there is a way to have Pinocchio run faster. I’d be more than happy to update it and my apologies if I did something wrong with the codegen!

Best,

Brian

On Sat, Sep 18, 2021 at 7:33 AM Justin Carpentier @.***> wrote:

Dear @plancherb1 https://github.com/plancherb1,

Thanks for sharing this set of benchmarks. I wonder why your are not also comparing with Pinocchio code gen support, which can be up to 10 times faster than standard CPU compiled code?

Best,

Justin

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/robot-acceleration/GRiDBenchmarks/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75GTZOKLP22LMVS555VCTUCR2HNANCNFSM5EJDV4NQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jcarpent commented 3 years ago

Thanks, Brian for the quick answer. I was looking at your paper, and this was not clear when reading it. Thanks for the clarification.

Reading your code, I can observe that you have a lot of allocation when performing the matrix/vector multiplication which will affect drastically the performances. You should use the noalias keywords especially in these lines: https://github.com/robot-acceleration/GRiDBenchmarks/blob/50365b648a2d714bce15dd59bf043ec3769f5d8e/timePinocchio.cpp#L134-L135

In addition, https://github.com/robot-acceleration/GRiDBenchmarks/blob/50365b648a2d714bce15dd59bf043ec3769f5d8e/timePinocchio.cpp#L41-L44 performs some computations that might be avoided if you were using a dedicated pass which assumes a zero-wise acceleration using the Eigen::VectorXd:Zero() expression. I do think it would be more efficient to use a dedicated implementation of the CodeGen using NLE function.

I think this line is not correct: https://github.com/robot-acceleration/GRiDBenchmarks/blob/50365b648a2d714bce15dd59bf043ec3769f5d8e/timePinocchio.cpp#L213 as it assumes that you don't have any quaternion part in your configuration vector, which is not the case with ATLAS for instance.

Concerning your thread, why this choice while OpenMP could do the same?

I think the size of your binaries may affect the performances. I would split the tests into several independent files to increase the performance. For instance, I do not know the impact of these lines: https://github.com/robot-acceleration/GRiDBenchmarks/blob/50365b648a2d714bce15dd59bf043ec3769f5d8e/timePinocchio.cpp#L198-L203 and how much it will perform several loading. It should be check at some point.

Finally, we are preparing the support for codegen on GPU on Pinocchio. It will be interesting to compare with your setup. If you want to take part to the initiative, you contribution is more than welcome.

jcarpent commented 3 years ago

Similarly to what nvcc does with Cuda, I will try to also use the vectorization of your CPU (march=native at least). This is not the case here. Additionally, I've realized that the matrices order matters a lot.

You should use the same matrix ordering as the default one used in Pinocchio to speed up the computations. I do think with all the previous remarks, you should obtain better performances with CPU pipelines.

plancherb1 commented 3 years ago

Thanks for all of the feedback I will try to get that incorporated soon! Also I'd love to talk to you about / help contribute with GPU Pinocchio -- I'm not experienced with your codegen tools -- but would love to help however I can (also if you have some good pointers for getting up to speed on CppAD and CppADCodegen do send them my way). I am very curious to see how they perform on the GPU. I wonder if they can optimize for multi-gpu threads within a single computation or just across computations.

A couple of followups below to make sure I understand all of your suggestions correctly so I can implement:

For the noAlias() are you suggesting that those lines (and similar lines for qdd) simply be replaced by dqdd_dqs[k].noAlias() = -(minv_code_gen->Minv)*(rnea_derivatives_code_gen->dtau_dq);
What do you mean by "a dedicated implementation of the CodeGen using NLE function"? Do you mean constructing a new codegen function which does not take in a qdd? I can try but could then definitely use some pointers on how to use those codegen tools then for sure lol
In these comparisons I did not use a floating base and so I believe that therefore in Pinocchio you do not have a quaternion to worry about. Also I did try to run codegen with a floating base (and a correct quaternion input) using Pinocchio and got a whole lot of segfaults (this may have been fixed in the past month since I pulled and/or I may have done something wrong but/and heads up on that -- I was actually going to send you a note this weekend now that I got the paper out) and so I didn't push to implement and test it for GRiD either (that's on my list for the next few weeks). I do think moving to floating base should change up both timings a bit and am curious to benchmark GRiD vs. your floating base codegen once I have both working.
I didn't use OpenMP because I did some testing about a year ago and my implementation beat the standard threadpool and so I just stuck with what I knew to previously be faster to try to give your code as fair of a test as possible (and if you recall there was a weird error building Pinocchio with OpenMP when I was trying to build). I should probably check back with the most recent version of OpenMP to see if that has changed. I believe the reason for the previous performance win is that in the way that I use my implementation, each thread is continuously hitting the same memory locations in the q, qd, and result arrays and so it leads to better cache performance.
I also did the multiple initializations of the code_gen objects because my hope was that the runtime would allocate different CPUs for different threads and then I didn't want them having to access serially the same object residing in the memory of the first core (or having to try to ensure cache coherence across cores) to run the codegen and I was also worried it might lead to concurrency bugs. If you think I should have all the threads hit the same object I can try that too.
Am I using a different matrix ordering? I don't think I changed things away from default? What change do I need to make to fix that?

Thanks again for all the feedback! I do want to be as fair as possible and honestly was wildly impressed with the speed of codegen on a single computation. It really made me work a lot harder to get wins on the GPU lol :)

jcarpent commented 3 years ago

I suggest having a private discussion after I'm back from holidays (until October, 4th).

* For the `noAlias()` are you suggesting that those lines (and similar lines for qdd) simply be replaced by `dqdd_dqs[k].noAlias() = -(minv_code_gen->Minv)*(rnea_derivatives_code_gen->dtau_dq);`

Yes, with noalias. For the best performances, Minv should be row major as left operand while dtau_dq should be col major. In the current implementation, Minv is col major because it was just used for bench, not to be used as an operand.

* I didn't use OpenMP because I did some testing about a year ago and my implementation beat the standard threadpool and so I just stuck with what I knew to previously be faster to try to give your code as fair of a test as possible (and if you recall there was a weird error building Pinocchio with OpenMP when I was trying to build). I should probably check back with the most recent version of OpenMP to see if that has changed. I believe the reason for the previous performance win is that in the way that I use my implementation, each thread is continuously hitting the same memory locations in the q, qd, and result arrays and so it leads to better cache performance.

Sounds interesting, it would be nice to explore it a bit more to understand why your implementation can be faster than OpenMP. OpenMP works perfectly with Pinocchio (at least Mac and Linux).

plancherb1 commented 3 years ago

Sounds great lets definitely re-connect when you are back, and I'll try to make some of those changes in the meantime.

Does it make the most sense for you to just send me an email when you are back and settled back in and we can find a time to chat? If that works for you I'm: brian_plancher@g.harvard.edu. Alternatively happy to work to calendar something now!

Enjoy your vacation!

robot-acceleration / GRiDBenchmarks

About Pinocchio code generation #1