joaoleal / CppADCodeGen

Source Code Generation for Automatic Differentiation using Operator Overloading
Other
162 stars 36 forks source link

CppADCodeGen support for Eigen `.transpose()` operation #61

Closed proyan closed 2 years ago

proyan commented 3 years ago

Hi João,

I'm trying to understand the performance benefits of using CppADCodeGen for doing sparse eigen multiplications. There is an issue that I'm facing, which I hope you might be able to help with.

In order to do a multiplication of two matrices in Eigen, A*B, and A.transpose() * B are almost equivalent in terms of performance. There is no additional computation burder imposed in Eigen because of the .transpose() operation.

However, when I try to code-generate these two computations, I see a difference in the computation time taken by codegen(A*B) and codegen(A.transpose() * B).

I've added a small unit-test that checks this computation. https://gist.github.com/proyan/76edb14c2835c53e66e086239a7e828f

In this file, there is a code-generated function calc that computes A*transpose() * B, where A is a triangular matrix, and B is a diagonal matrix.

When running this calc function in normal, and code-generated (and compiled) modes, I get the following timings:

$ ./test-cg 
calc code-gen =         16.1503 us
calc =      25.685 us

So there is some performance benefit in code-generation.

However, if I remove the .transpose() operation from the above calc operation i.e., replace Line 119 of the script with

  tmp.OSIMinv_tmp.noalias() = -tmp.U1inv * M.diagonal().asDiagonal();                                                                                                             

I get the following timings after compiling the new calc, which computes A*B:

$ ./test-cg 
calc code-gen =         10.1578 us
calc =      25.0839 us

As you can see, there is a difference of 5us (quite high when compared to the total time taken), while the operation in Eigen is following the same computation performance.

I've checked other parts of this function, and solveInPlace work as it should, and the other matrix multiplication works as it should. Eigen simply changes the order of operations when doing the .transpose() multiplication, so I'm guessing that there is a flag that is not being taken care of here. But this is only my guess, I wonder if you understand what the issue is here?

Thanks, and best, Rohan

proyan commented 3 years ago

The compilation instructions for running the file are:

clang++ $(pkg-config --cflags eigen3 cppadcg) -O3 -DNDEBUG -std=gnu++11 test-cg.cpp -o test-cg -ldl
joaoleal commented 3 years ago

Hello,

Sorry for the late reply. I haven't much time to fully determine the differences but the code with and without the transpose is not doing the same operations for some reason. I've reduced ND to 5 and saved the generated code.

Here is part of the code with the transpose():

   // auxiliary variables
   double v[10];

   y[0] = -1 * x[0];
   v[0] = -1 * x[5];
   v[1] = (0 - v[0]) * x[6];
   y[5] = y[0] * v[0] + v[1];
   y[6] = -1 * x[6];
   v[0] = 0 - x[11];
   v[2] = -1 * (x[5] * v[0] + x[10]);
   v[3] = (0 - v[2]) * x[12];
   y[10] = v[1] * v[0] + y[0] * v[2] + v[3];
   v[2] = (0 - v[0]) * x[12];
   y[11] = y[6] * v[0] + v[2];
   y[12] = -1 * x[12];
   v[0] = 0 - x[17];
   v[4] = 0 - x[16] - v[0] * x[11];
   v[5] = -1 * (x[10] * v[0] + x[5] * v[4] + x[15]);
   v[6] = (0 - v[5]) * x[18];
   y[15] = v[1] * v[4] + y[0] * v[5] + v[3] * v[0] + v[6];
   v[5] = (0 - v[4]) * x[18];
   y[16] = v[2] * v[0] + y[6] * v[4] + v[5];
   v[4] = (0 - v[0]) * x[18];
   y[17] = y[12] * v[0] + v[4];
   y[18] = -1 * x[18];
   v[0] = 0 - x[23];
   v[7] = 0 - x[22] - v[0] * x[17];
   v[8] = 0 - x[21] - v[0] * x[16] - v[7] * x[11];
   v[9] = -1 * (x[10] * v[7] + x[5] * v[8] + x[15] * v[0] + x[20]);
   y[20] = v[1] * v[8] + y[0] * v[9] + v[3] * v[7] + v[6] * v[0] + (0 - v[9]) * x[24];
   y[21] = v[2] * v[7] + y[6] * v[8] + v[5] * v[0] + (0 - v[8]) * x[24];
   y[22] = v[4] * v[0] + y[12] * v[7] + (0 - v[7]) * x[24];
   y[23] = y[18] * v[0] + (0 - v[0]) * x[24];
   y[24] = -1 * x[24];
   // dependent variables without operations
   y[1] = 0;
   y[2] = 0;
   y[3] = 0;
   y[4] = 0;
   y[7] = 0;
   y[8] = 0;
   y[9] = 0;
   y[13] = 0;
   y[14] = 0;
   y[19] = 0;

And here the generated code without the transpose():

   // auxiliary variables
   double v[16];

   y[0] = -1 * x[0];
   v[0] = -1 * x[5];
   y[1] = (0 - v[0]) * x[0];
   v[1] = 0 - x[11];
   v[2] = -1 * (x[5] * v[1] + x[10]);
   y[2] = (0 - v[2]) * x[0];
   v[3] = 0 - x[17];
   v[4] = 0 - x[16] - v[3] * x[11];
   v[5] = -1 * (x[10] * v[3] + x[5] * v[4] + x[15]);
   y[3] = (0 - v[5]) * x[0];
   v[6] = 0 - x[23];
   v[7] = 0 - x[22] - v[6] * x[17];
   v[8] = 0 - x[21] - v[6] * x[16] - v[7] * x[11];
   v[9] = -1 * (x[10] * v[7] + x[5] * v[8] + x[15] * v[6] + x[20]);
   y[4] = (0 - v[9]) * x[0];
   y[5] = y[0] * v[0];
   v[10] = -1 * x[6];
   y[6] = y[1] * v[0] + v[10];
   v[11] = (0 - v[1]) * x[6];
   y[7] = y[2] * v[0] + v[11];
   v[12] = (0 - v[4]) * x[6];
   y[8] = y[3] * v[0] + v[12];
   v[13] = (0 - v[8]) * x[6];
   y[9] = y[4] * v[0] + v[13];
   y[10] = y[0] * v[2];
   y[11] = y[1] * v[2] + v[10] * v[1];
   v[0] = -1 * x[12];
   y[12] = v[11] * v[1] + y[2] * v[2] + v[0];
   v[14] = (0 - v[3]) * x[12];
   y[13] = v[12] * v[1] + y[3] * v[2] + v[14];
   v[15] = (0 - v[7]) * x[12];
   y[14] = v[13] * v[1] + y[4] * v[2] + v[15];
   y[15] = y[0] * v[5];
   y[16] = y[1] * v[5] + v[10] * v[4];
   y[17] = v[11] * v[4] + y[2] * v[5] + v[0] * v[3];
   v[2] = -1 * x[18];
   y[18] = v[12] * v[4] + y[3] * v[5] + v[14] * v[3] + v[2];
   v[1] = (0 - v[6]) * x[18];
   y[19] = v[13] * v[4] + y[4] * v[5] + v[15] * v[3] + v[1];
   y[20] = y[0] * v[9];
   y[21] = y[1] * v[9] + v[10] * v[8];
   y[22] = v[11] * v[8] + y[2] * v[9] + v[0] * v[7];
   y[23] = v[12] * v[8] + y[3] * v[9] + v[14] * v[7] + v[2] * v[6];
   y[24] = v[13] * v[8] + y[4] * v[9] + v[15] * v[7] + v[1] * v[6] + -1 * x[24];

As you can see, with transpose() performs less operations, uses less temporary variables and some of the output variables are just set to zero. Considering that the error between the output from codegen and pure eigen is very small (~2.77556e-17) this probably something related either to eigen or the example. This needs further investigation.

bradbell commented 3 years ago

I think it would help to know if the difference in the number of operations is just in the CppAD function ad_fun. You can check this using the function calls:

ad_fun.size_op()
ad_fun.size_var()
ad_fun.size_par()

If so, this would enable you to simplify your test while tracking down what is going on.

joaoleal commented 3 years ago

with transpose(): # op = 131, # var = 130, # par = 12

without transpose(): # op = 115, # var = 114, # par = 12

It is also interesting to note that only the upper right of the matrix osim is defined without transpose().

bradbell commented 3 years ago

time ratio: 16.1 / 10.1 = 1.5941 operation ratio: 131 / 115 = 1.1391 (variable ratio is similar)

I suspect that the timing difference has to do with the order in which you are accessing the elements of the matrix. Perhaps if you switch from column major to row major the timing results would switch ?

cmastalli commented 3 years ago

There is something not clear to me. Is the same operation ratio for ND=50? (I mean, can we assume that?)