Dear all,

I would like to implement a 4-stage pipeline using Halide. The first two stages and the last stage can be expressed as pure functions of the input images. However, the middle stage needs to solve an eigenvalues & eigenvectors problem for each output-pixel using the jacobi iterative method (I copied the C-code below) i.e., To update each output pixel I need to calcute the eigenvalues of a symetric 3X3 matrix. The jacobi iterative method can take between 1 to 50 iterations to converge depending on the values of the 3x3 matrices. This means that the number of the needed arithmetic opertations varies drastically between different pixels.

My intention is to explore the full fusion of the 4-stages with Halide. Is it possible to express this stage only using Halide func an Expr?

Best regards,

the C-code of the jacobi iterative method is shown below: /----------------------------------------------------------------------------/ void jacobi3(float a[3][3], float d[3], float v[3][3]) /* This Jacobi routine Accepts and Returns the matrices in C convention. It also includes the ordering of the eigensystem, embedding the eigsrt routine from Numerical Recipes. This ordering is located just before the return. In principle, this routine only works for 3x3 matrices, but it can easily extended to any value of n if:

n is included as new parameter of the routine. (in this routine, n is set up to 3)
the vectors b and z are dinamically memory-allocated just in the first lines of the routine
the vectors b and z are dinamically freed.

It computes all eigenvalues and eigenvectors of a real symmetric matrix
a[0..(n-1)][0..(n-1)]. On output, elements of a above the diagonal are
destroyed. d[0..(n-1)] returns the eigenvalues of a, in descending order.
v[0..(n-1)][0..(n-1)] is a matrix whose rows contain, on output, the
normalized eigenvectors of a. nrot returns the number of Jacobi rotations
that were required.
Note the difference with respect to the routine in Numerical Recipes:
a and v follow the C convention for matrices (row-major order), and
the eigensystem is already ordered on output.
*/

{

define ROTATE(a,i,j,k,l) g=a[i][j];h=a[k][l];a[i][j]=g-s_(h+g_tau);\

a[k][l]=h+s*(g-h*tau);

int j,iq,ip,i,k; float tresh,theta,tau,t,sm,s,h,g,c,p,b[3],z[3]; int n=3; int nrot;

/* Initialize to the identity matrix. */ for (ip=0;ip<n;ip++) { for (iq=0;iq<n;iq++) v[iq][ip]=0.0; v[ip][ip]=1.0; }

/* Initialize b and d to the diagonal of a. */ for (ip=0;ip<n;ip++) { b[ip]=d[ip]=a[ip][ip]; }

/* This vector accumulates terms tapq as in Eq.(11.1.14) in Numerical Recipes.*/ for (ip=0;ip<n;ip++) z[ip]=0.0;

/* Iterative procedure. */ nrot=0; for (i=0;i<50;i++) {

/* Sum off-diagonal elements. */
sm=0.0;
for (ip=0;ip<n-1;ip++) {
    for (iq=ip+1;iq<n;iq++)
        sm += fabs(a[iq][ip]);
    }

if (sm == 0.0) {  /* --- Successful Computation of the EigenSystem --- */
/* The normal return, which relies on quadratic convergence to 
   machine underflow*/  

    /* --- Reordering the eigenvalues/vectors. ---
   Taken from eigsrt routine from Numerical Recipes.
   It sorts the eigenvalues in d into descending order, and rearranges
   the eigenvectors v correspondingly. The method is straight insertion.*/
    for (i=0;i<n;i++) {
        p=d[k=i];
        for (j=i+1;j<n;j++)
                if (d[j] >= p) p=d[k=j];
        if (k != i) {
            d[k]=d[i];
            d[i]=p;
            for (j=0;j<n;j++) {
                p=v[i][j];
                v[i][j]=v[k][j];
                v[k][j]=p;
                }
            }
        }

    return;
}

if (i < 3)  
        tresh=0.2*sm/(n*n);             /* ... on the three first sweeps. */
else
        tresh=0.0;                      /* ... thereafter. */

for (ip=0;ip<n-1;ip++) {
    for (iq=ip+1;iq<n;iq++) {
        g=100.0*fabs(a[iq][ip]);        
    /* After four sweeps, skips the rotation if the off-diagonal
       element is small. */
        if ( i > 3 && 
         (float)(fabs(d[ip])+g) == (float)fabs(d[ip]) && 
     (float)(fabs(d[iq])+g) == (float)fabs(d[iq])) {
            a[iq][ip]=0.0;
    }
        else if (fabs(a[iq][ip]) > tresh) {
            h=d[iq]-d[ip];
            if ((float)(fabs(h)+g) == (float)fabs(h))
        t=(a[iq][ip])/h;
            else {
                theta=0.5*h/(a[iq][ip]);
                t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
                if (theta < 0.0) t = -t;
                }
            c=1.0/sqrt(1+t*t);
            s=t*c;
            tau=s/(1.0+c);
            h=t*a[iq][ip];
            z[ip] -= h;
            z[iq] += h;
            d[ip] -= h;
            d[iq] += h;
            a[iq][ip]=0.0;
        for (j=0;j<=ip-1;j++) {
            ROTATE(a,ip,j,iq,j);
            }
        for (j=ip+1;j<=iq-1;j++) {
            ROTATE(a,j,ip,iq,j);
            }
        for (j=iq+1;j<n;j++) {
            ROTATE(a,j,ip,j,iq);
            }
        for (j=0;j<n;j++) {
            ROTATE(v,ip,j,iq,j);
            }
        nrot++;
            }
        }
    }
/* Update d with the sum of tapq, and reinitialize z. */
for (ip=0;ip<n;ip++) {
    b[ip] += z[ip];
    d[ip]=b[ip];
    z[ip]=0.0;
    }
}

fprintf(stderr,"\nToo many iterations in routine jacobi: %i\n", nrot);

/* This is not the way by which the routine should exit. / / --- Reordering the eigenvalues/vectors. --- Taken from eigsrt routine from Numerical Recipes. It sorts the eigenvalues in d into descending order, and rearranges the eigenvectors v correspondingly. The method is straight insertion.*/ for (i=0;i<n;i++) { p=d[k=i]; for (j=i+1;j<n;j++) if (d[j] >= p) p=d[k=j]; if (k != i) { d[k]=d[i]; d[i]=p; for (j=0;j<n;j++) { p=v[i][j]; v[i][j]=v[k][j]; v[k][j]=p; } } }

undef ROTATE

} /---------------------------------------------------------------------------/

I don't have time to go into detail, but my guess is the best approach is to implement the Jacobi iteration as an external array function (the thing you get from define_extern) using the C code you already have. The inputs and outputs are small, but large enough to need to be stored in memory anyway, so the overhead of an external array function won't be too bad and you'll still have options re: scheduling the stages around this. It won't speed up the Jacobi iteration and it won't get you on a GPU however. (I might look at using Halide to implement a routine that does one step of the refinement on many matrices at once using vectorization and then calling it inside a loop from C++...)

Halide doesn't really have a way to iterate until convergence at present. Other folks may have some ideas, but it is a case where I really expect greater language power is needed to do this.

-Z-

For 3x3 matrices, the closed form solution is certainly faster, regardless of what language you use. As a bonus, it can be expressed in halide as a pure function :)

http://mathworld.wolfram.com/CubicFormula.html

On Mon, Feb 15, 2016, 12:09 PM Zalman Stern notifications@github.com wrote:

I don't have time to go into detail, but my guess is the best approach is to implement the Jacobi iteration as an external array function (the thing you get from define_extern) using the C code you already have. The inputs and outputs are small, but large enough to need to be stored in memory anyway, so the overhead of an external array function won't be too bad and you'll still have options re: scheduling the stages around this. It won't speed up the Jacobi iteration and it won't get you on a GPU however. (I might look at using Halide to implement a routine that does one step of the refinement on many matrices at once using vectorization and then calling it inside a loop from C++...)

Halide doesn't really have a way to iterate until convergence at present. Other folks may have some ideas, but it is a case where I really expect greater language power is needed to do this.

-Z-

— Reply to this email directly or view it on GitHub https://github.com/halide/Halide/issues/1026#issuecomment-184368388.

For numerical stability and robustness issues the jacobi iterative method is the most stable.

My objective is to evaluate fusing/collapsing all the stages of my pipeline. If I use "define_extern" and pass the data through "buffer_t" for the complex stage, I will not be able to fuse that stage with others.

You can still fuse the pipeline with define_extern. The extern stage mechanism supports bounds inference, and extern stages can be computed at other stages.

Also, while naive implementations of the cubic (or even just quadratic) formulas is numerically unstable, there are ways to rewrite them such that they are numerically stable (potentially using several variants depending on the inputs). For an example with the quadratic formula, see http://people.csail.mit.edu/bkph/articles/Quadratics.pdf. This might be more difficult to do with the cubic formula, but may be worth exploring if performance/precision are issues.

My target architecture is the GPU and my objective is to evaluate the fusion of the 4 stages described below (gauss, ST, DT, PDE). DT is an external stage. I am having troubles with mapping and running the external stage individually and fused with the rest of stages on the GPU.

I get this error when I use the scheduling described in the code: In schedule for DTCould not find split dimension: x

/** Stage 1 / gauss(x, y, z) = conv0 * input(x, y, z) + conv1 * ( input(x-1, y, z) + input(x+1, y, z) + input(x, y-1, z) + input(x, y+1, z) + input(x, y, z-1) + input(x, y, z+1));

/*\ Stage 2 / / For each (x,y,z) I need to compute 6 elements: st_tensor0, st_tensor1, st_tensor2, st_tensor3, st_tensor4, st_tensor5 */

st_tenso[0](x, y, z) = (gauss(x+1, y, z) - gauss(x-1, y, z)) * iTwo_hx * (gauss(x+1, y, z) - gauss(x-1, y, z)) * iTwo_hx; st_tensor[1](x, y, z) = (gauss(x+1, y, z) - gauss(x-1, y, z)) * iTwo_hx * (gauss(x, y+1, z) - gauss(x, y-1, z)) * iTwo_hy; st_tensor[2](x, y, z) = (gauss(x+1, y, z) - gauss(x-1, y, z)) * iTwo_hx * (gauss(x, y, z+1) - gauss(x, y, z-1)) * iTwo_hz; st_tensor[3](x, y, z) = (gauss(x, y+1, z) - gauss(x, y-1, z)) * iTwo_hy * (gauss(x, y+1, z) - gauss(x, y-1, z)) * iTwo_hy; st_tensor[4](x, y, z) = (gauss(x, y+1, z) - gauss(x, y-1, z)) * iTwo_hy * (gauss(x, y, z+1) - gauss(x, y, z-1)) * iTwo_hz; st_tensor[5](x, y, z) = (gauss(x, y, z+1) - gauss(x, y, z-1)) * iTwo_hz * (gauss(x, y, z+1) - gauss(x, y, z-1)) * iTwo_hz;

/*\ Stage 3 / / DT computes the eigen-values and vectors of each one of the symmetric matrix asociated to each pixel st_tensor(x,y,z). DT reads and writes from/in the same structure, i.e., st_tensor */

Func DT;
std::vector<ExternFuncArgument> argIn;
argIn.push_back(st_tensor[0]);
argIn.push_back(st_tensor[1]);
argIn.push_back(st_tensor[2]);
argIn.push_back(st_tensor[3]);
argIn.push_back(st_tensor[4]);
argIn.push_back(st_tensor[5]);

std::vector<Type> argOut;
argOut.push_back(Float(32));
argOut.push_back(Float(32));
argOut.push_back(Float(32));
argOut.push_back(Float(32));
argOut.push_back(Float(32));
argOut.push_back(Float(32));
DT.define_extern("Jacobi_3x3", argIn, argOut, 3);

/** Stage 4 / PDE(x,y,z)=Expressed in term of (DT(x,y,z)[0],DT(x,y,z)[1],DT(x,y,z)[2],DT(x,y,z)[3],DT(x,y,z)[4],DT(x,y,z)[5],input(x,y,z))

/* Schedule for the GPU _/ gauss.gpu_tile(x, y, z, 32, 8, 1); st_tensor [0].gpu_tile(x, y, z, 32, 8, 1); st_tensor [1].gpu_tile(x, y, z, 32, 8, 1); st_tensor [2].gpu_tile(x, y, z, 32, 8, 1); st_tensor [3].gpu_tile(x, y, z, 32, 8, 1); st_tensor [4].gpu_tile(x, y, z, 32, 8, 1); st_tensor [5].gpu_tile(x, y, z, 32, 8, 1); DT.gputile(x,y,z,32,1,1); / How to correctly schedule DT on the GPU? to be able to fuse it with the other stages _/ PDE.gputile(x, y, z, 32, 8, 1); /***/ // Construct a target that uses the GPU. Halide::Target target = Halide::get_host_target();

// Enable OpenCL as the GPU backend. target.set_feature(Halide::Target::CUDA);

// JIT-compile the pipeline. gauss.compile_jit(target); st_tensor[0].compile_jit(target); st_tensor[1].compile_jit(target); st_tensor[2].compile_jit(target); st_tensor[3].compile_jit(target); st_tensor[4].compile_jit(target); st_tensor[5].compile_jit(target); DT.compile_jit(target); PDE.compile_jit(target);

// Run it. gauss.realize(Xdim, Ydim, Zdim); st_tensor[0].realize(Xdim, Ydim, Zdim); st_tensor[1].realize(Xdim, Ydim, Zdim); st_tensor[2].realize(Xdim, Ydim, Zdim); st_tensor[3].realize(Xdim, Ydim, Zdim); st_tensor[4].realize(Xdim, Ydim, Zdim); st_tensor[5].realize(Xdim, Ydim, Zdim); DT.realize(Xdim, Ydim, Zdim);

Ah, yeah, if you are trying to run on the GPU, extern stages are not an option. Sorry if I missed that. I see Zalman mentioned this, but I didn't see any mention of it in the original message.

Another option (in addition to the closed form solution) is that you could probably write the Jacobi iteration using update definitions. However, as Zalman says, you won't be able to terminate the iteration at convergence in a reasonable way. Here are the two options I would consider if I were in your position (both previously mentioned):

Use a closed form solution, rewriting some of the expressions to avoid numerically unstable operations.
Use a more naive closed form solution as an initial guess, with a fixed number of Jacobi iterations sufficient to recover the numerical precision required. (However, I suspect that the error from the non-numerically stable closed form solution might be arbitrarily large, so it might not be reliable enough to get away with using a small fixed number of iterations.)

Please, help me to decide whether it makes sense implemeting my code in Halide. I have to submit the revised version of a manuscript as soon as possible. One of the reviewers asked me to compare my hand-written codes with a Halide version. My codes manually implement all the possible fusions of different stages with and without tiling.

The problematic stage of the pipeline needs to calculate the eigenvalues and vectores of a large number of small 3x3 symetric matrices at the same time for each pixel on the GPU. Following your recommendation, this is how an analytical method looks like (I copied the code below). I took this C-code, which computes the eigenvalues and eigenvectors for a symmetric 3x3 matrix from: https://www.mpi-hd.mpg.de/personalhomes/globes/3x3/

Do you think that it is possible to express this method using Halide language?
Do you think that the resulting Halide implementation will be efficient?
With the JIT compilation of the two first stages Halide is more than 10 times slower than my CUDA-code. Do you think that using the ahead-of-time compilation the execution of Halide version will be comparable with my implementation?

// This is the code that computes the eigenvectors:

include

// Macros

define SQR(x) ((x)*(x)) // x^2

// ---------------------------------------------------------------------------- int dsyevv3(float A[3][3], float Q[3][3], float w[3]) // ---------------------------------------------------------------------------- // Calculates the eigenvalues and normalized eigenvectors of a symmetric 3x3 // matrix A using Cardano's method for the eigenvalues and an analytical // method based on vector cross products for the eigenvectors. // Only the diagonal and upper triangular parts of A need to contain meaningful // values. However, all of A may be used as temporary storage and may hence be // destroyed. // ---------------------------------------------------------------------------- // Parameters: // A: The symmetric input matrix // Q: Storage buffer for eigenvectors // w: Storage buffer for eigenvalues // ---------------------------------------------------------------------------- // Return value: // 0: Success // -1: Error // ---------------------------------------------------------------------------- // Dependencies: // dsyevc3() // ---------------------------------------------------------------------------- {

ifndef EVALS_ONLY

float norm; // Squared norm or inverse norm of current eigenvector float n0, n1; // Norm of first and second columns of A float n0tmp, n1tmp; // "Templates" for the calculation of n0/n1 - saves a few FLOPS float thresh; // Small number used as threshold for floating point comparisons float error; // Estimated maximum roundoff error in some steps float wmax; // The eigenvalue of maximum modulus float f, t; // Intermediate storage int i, j; // Loop counters

endif

// Calculate eigenvalues dsyevc3(A, w);

ifndef EVALS_ONLY

wmax = fabs(w[0]); if ((t=fabs(w[1])) > wmax) wmax = t; if ((t=fabs(w[2])) > wmax) wmax = t; thresh = SQR(8.0 * DBL_EPSILON * wmax);

// Prepare calculation of eigenvectors n0tmp = SQR(A[0][1]) + SQR(A[0][2]); n1tmp = SQR(A[0][1]) + SQR(A[1][2]); Q[0][1] = A[0][1]_A[1][2] - A[0][2]_A[1][1]; Q[1][1] = A[0][2]_A[1][0] - A[1][2]_A[0][0]; Q[2][1] = SQR(A[0][1]);

// Calculate first eigenvector by the formula // v[0] = (A - w[0]).e1 x (A - w[0]).e2 A[0][0] -= w[0]; A[1][1] -= w[0]; Q[0][0] = Q[0][1] + A[0][2]_w[0]; Q[1][0] = Q[1][1] + A[1][2]_w[0]; Q[2][0] = A[0][0]A[1][1] - Q[2][1]; norm = SQR(Q[0][0]) + SQR(Q[1][0]) + SQR(Q[2][0]); n0 = n0tmp + SQR(A[0][0]); n1 = n1tmp + SQR(A[1][1]); error = n0 \ n1;

if (n0 <= thresh) // If the first column is zero, then (1,0,0) is an eigenvector { Q[0][0] = 1.0; Q[1][0] = 0.0; Q[2][0] = 0.0; } else if (n1 <= thresh) // If the second column is zero, then (0,1,0) is an eigenvector { Q[0][0] = 0.0; Q[1][0] = 1.0; Q[2][0] = 0.0; } else if (norm < SQR(64.0 * DBL_EPSILON) * error) { // If angle between A[0] and A[1] is too small, don't use t = SQR(A[0][1]); // cross product, but calculate v ~ (1, -A0/A1, 0) f = -A[0][0] / A[0][1]; if (SQR(A[1][1]) > t) { t = SQR(A[1][1]); f = -A[0][1] / A[1][1]; } if (SQR(A[1][2]) > t) f = -A[0][2] / A[1][2]; norm = 1.0/sqrt(1 + SQR(f)); Q[0][0] = norm; Q[1][0] = f * norm; Q[2][0] = 0.0; } else // This is the standard branch { norm = sqrt(1.0 / norm); for (j=0; j < 3; j++) Q[j][0] = Q[j][0] * norm; }

// Prepare calculation of second eigenvector t = w[0] - w[1]; if (fabs(t) > 8.0 * DBL_EPSILON * wmax) { // For non-degenerate eigenvalue, calculate second eigenvector by the formula // v[1] = (A - w[1]).e1 x (A - w[1]).e2 A[0][0] += t; A[1][1] += t; Q[0][1] = Q[0][1] + A[0][2]_w[1]; Q[1][1] = Q[1][1] + A[1][2]_w[1]; Q[2][1] = A[0][0]A[1][1] - Q[2][1]; norm = SQR(Q[0][1]) + SQR(Q[1][1]) + SQR(Q[2][1]); n0 = n0tmp + SQR(A[0][0]); n1 = n1tmp + SQR(A[1][1]); error = n0 \ n1;

if (n0 <= thresh)       // If the first column is zero, then (1,0,0) is an eigenvector
{
  Q[0][1] = 1.0;
  Q[1][1] = 0.0;
  Q[2][1] = 0.0;
}
else if (n1 <= thresh)  // If the second column is zero, then (0,1,0) is an eigenvector
{
  Q[0][1] = 0.0;
  Q[1][1] = 1.0;
  Q[2][1] = 0.0;
}
else if (norm < SQR(64.0 * DBL_EPSILON) * error)
{                       // If angle between A[0] and A[1] is too small, don't use
  t = SQR(A[0][1]);     // cross product, but calculate v ~ (1, -A0/A1, 0)
  f = -A[0][0] / A[0][1];
  if (SQR(A[1][1]) > t)
  {
    t = SQR(A[1][1]);
    f = -A[0][1] / A[1][1];
  }
  if (SQR(A[1][2]) > t)
    f = -A[0][2] / A[1][2];
  norm    = 1.0/sqrt(1 + SQR(f));
  Q[0][1] = norm;
  Q[1][1] = f * norm;
  Q[2][1] = 0.0;
}
else
{
  norm = sqrt(1.0 / norm);
  for (j=0; j < 3; j++)
    Q[j][1] = Q[j][1] * norm;
}

} else { // For degenerate eigenvalue, calculate second eigenvector according to // v[1] = v[0] x (A - w[1]).e[i] //
// This would really get to complicated if we could not assume all of A to // contain meaningful values. A[1][0] = A[0][1]; A[2][0] = A[0][2]; A[2][1] = A[1][2]; A[0][0] += w[0]; A[1][1] += w[0]; for (i=0; i < 3; i++) { A[i][i] -= w[1]; n0 = SQR(A[0][i]) + SQR(A[1][i]) + SQR(A[2][i]); if (n0 > thresh) { Q[0][1] = Q[1][0]_A[2][i] - Q[2][0]_A[1][i]; Q[1][1] = Q[2][0]_A[0][i] - Q[0][0]_A[2][i]; Q[2][1] = Q[0][0]_A[1][i] - Q[1][0]_A[0][i]; norm = SQR(Q[0][1]) + SQR(Q[1][1]) + SQR(Q[2][1]); if (norm > SQR(256.0 * DBL_EPSILON) * n0) // Accept cross product only if the angle between { // the two vectors was not too small norm = sqrt(1.0 / norm); for (j=0; j < 3; j++) Q[j][1] = Q[j][1] * norm; break; } } }

if (i == 3)    // This means that any vector orthogonal to v[0] is an EV.
{
  for (j=0; j < 3; j++)
    if (Q[j][0] != 0.0)                                   // Find nonzero element of v[0] ...
    {                                                     // ... and swap it with the next one
      norm          = 1.0 / sqrt(SQR(Q[j][0]) + SQR(Q[(j+1)%3][0]));
      Q[j][1]       = Q[(j+1)%3][0] * norm;
      Q[(j+1)%3][1] = -Q[j][0] * norm;
      Q[(j+2)%3][1] = 0.0;
      break;
    }
}

}

// Calculate third eigenvector according to // v[2] = v[0] x v[1] Q[0][2] = Q[1][0]_Q[2][1] - Q[2][0]_Q[1][1]; Q[1][2] = Q[2][0]_Q[0][1] - Q[0][0]_Q[2][1]; Q[2][2] = Q[0][0]_Q[1][1] - Q[1][0]_Q[0][1];

endif

return 0; }

//This is the code that calculate the eigenvalues // ---------------------------------------------------------------------------- // Numerical diagonalization of 3x3 matrcies // Copyright (C) 2006 Joachim Kopp // ---------------------------------------------------------------------------- // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA // ----------------------------------------------------------------------------

include

//#include "dsyevc3.h"

// Constants

define M_SQRT3 1.73205080756887729352744634151 // sqrt(3)

// Macros

define SQR(x) ((x)*(x)) // x^2

// ---------------------------------------------------------------------------- int dsyevc3(float A[3][3], float w[3]) // ---------------------------------------------------------------------------- // Calculates the eigenvalues of a symmetric 3x3 matrix A using Cardano's // analytical algorithm. // Only the diagonal and upper triangular parts of A are accessed. The access // is read-only. // ---------------------------------------------------------------------------- // Parameters: // A: The symmetric input matrix // w: Storage buffer for eigenvalues // ---------------------------------------------------------------------------- // Return value: // 0: Success // -1: Error // ---------------------------------------------------------------------------- { float m, c1, c0;

// Determine coefficients of characteristic poynomial. We write // | a d f | // A = | d* b e | // | f* e* c | float de = A[0][1] * A[1][2]; // d * e float dd = SQR(A[0][1]); // d^2 float ee = SQR(A[1][2]); // e^2 float ff = SQR(A[0][2]); // f^2 m = A[0][0] + A[1][1] + A[2][2]; c1 = (A[0][0]_A[1][1] + A[0][0]_A[2][2] + A[1][1]_A[2][2]) // a_b + a_c + b_c - d^2 - e^2 - f^2

(dd + ee + ff); c0 = A[2][2]_dd + A[0][0]_ee + A[1][1]_ff - A[0][0]_A[1][1]_A[2][2]
- 2.0 * A[0][2]_de; // c_d^2 + a_e^2 + b_f^2 - a_b_c - 2_f_d_e)
float p, sqrt_p, q, c, s, phi; p = SQR(m) - 3.0c1; q = m(p - (3.0/2.0)_c1) - (27.0/2.0)_c0; sqrt_p = sqrt(fabs(p));

phi = 27.0 * ( 0.25SQR(c1)(p - c1) + c0_(q + 27.0/4.0_c0)); phi = (1.0/3.0) * atan2(sqrt(fabs(phi)), q);

c = sqrt_p_cos(phi); s = (1.0/M_SQRT3)_sqrt_p*sin(phi);

w[1] = (1.0/3.0)*(m - c); w[2] = w[1] + s; w[0] = w[1] + c; w[1] -= s;

return 0; }

I think the reviewer is being unreasonable to ask you to learn a new programming language to do a comparison. The point of Halide is that it makes it easier to try lots of different fusion/parallelization strategies quickly, and to make it easier to vectorize on a CPU or port to the GPU. If you've already done all that work by hand, then why use Halide?

That said, I think that this sort of code is ideal for Halide. If I were writing it from scratch I would definitely use Halide and would not expect to be able to beat the performance with hand-written code. It might take a little experience with the language for this to be true, which is why I think your reviewer is unreasonable.

If you want to know the performance of AOT compiled Halide without changing your build setup, just don't include jit compilation in your timings. You probably also don't want to include copying to/from the GPU in your timings. Tutorial 12 demonstrates how to do this.

On Thu, Feb 25, 2016 at 9:56 AM, DoDNet notifications@github.com wrote:

Please, help me to decide whether it makes sense implemeting my code in Halide. I have to submit the revised version of a manuscript as soon as possible. One of the reviewers asked me to compare my hand-written codes with a Halide version. My codes manually implement all the possible fusions of different stages with and without tiling.

The problematic stage of the pipeline needs to calculate the eigenvalues and vectores of a large number of small 3x3 symetric matrices at the same time for each pixel on the GPU. This is how an analytical method looks like (I copied the code below). I took this C-code, which computes the eigenvalues and eigenvectors for a symmetric 3x3 matrix from: https://www.mpi-hd.mpg.de/personalhomes/globes/3x3/

Do you think that it is possible to express this method using Halide language?

Do you think that the resulting Halide implementation will be efficient?

With the JIT compilation of the two first stages Halide is more than 10 times slower than my CUDA-code. Do you think that using the ahead-of-time compilation the Halide version will be comparable with my implementation?

// This is the code that computes the eigenvectors:

include

// Macros

define SQR(x) ((x)*(x)) // x^2

//

int dsyevv3(float A[3][3], float Q[3][3], float w[3])

//

// Calculates the eigenvalues and normalized eigenvectors of a symmetric 3x3 // matrix A using Cardano's method for the eigenvalues and an analytical // method based on vector cross products for the eigenvectors. // Only the diagonal and upper triangular parts of A need to contain meaningful // values. However, all of A may be used as temporary storage and may hence be // destroyed.

//

// Parameters: // A: The symmetric input matrix // Q: Storage buffer for eigenvectors // w: Storage buffer for eigenvalues

//

// Return value: // 0: Success // -1: Error

//

// Dependencies: // dsyevc3()

//

{

ifndef EVALS_ONLY

float norm; // Squared norm or inverse norm of current eigenvector float n0, n1; // Norm of first and second columns of A float n0tmp, n1tmp; // "Templates" for the calculation of n0/n1 - saves a few FLOPS float thresh; // Small number used as threshold for floating point comparisons float error; // Estimated maximum roundoff error in some steps float wmax; // The eigenvalue of maximum modulus float f, t; // Intermediate storage int i, j; // Loop counters

endif

// Calculate eigenvalues dsyevc3(A, w);

ifndef EVALS_ONLY

wmax = fabs(w[0]); if ((t=fabs(w[1])) > wmax) wmax = t; if ((t=fabs(w[2])) > wmax) wmax = t; thresh = SQR(8.0 * DBL_EPSILON * wmax);

// Prepare calculation of eigenvectors n0tmp = SQR(A[0][1]) + SQR(A[0][2]); n1tmp = SQR(A[0][1]) + SQR(A[1][2]); Q[0][1] = A[0][1]_A[1][2] - A[0][2]_A[1][1]; Q[1][1] = A[0][2]_A[1][0] - A[1][2]_A[0][0]; Q[2][1] = SQR(A[0][1]);

// Calculate first eigenvector by the formula // v[0] = (A - w[0]).e1 x (A - w[0]).e2 A[0][0] -= w[0]; A[1][1] -= w[0]; Q[0][0] = Q[0][1] + A[0][2] _w[0]; Q[1][0] = Q[1][1] + A[1][2]_w[0]; Q[2][0] = A[0][0]A[1][1] - Q[2][1]; norm = SQR(Q[0][0]) + SQR(Q[1][0]) + SQR(Q[2][0]); n0 = n0tmp + SQR(A[0][0]); n1 = n1tmp + SQR(A[1][1]); error = n0 \ n1;

if (n0 <= thresh) // If the first column is zero, then (1,0,0) is an eigenvector { Q[0][0] = 1.0; Q[1][0] = 0.0; Q[2][0] = 0.0; } else if (n1 <= thresh) // If the second column is zero, then (0,1,0) is an eigenvector { Q[0][0] = 0.0; Q[1][0] = 1.0; Q[2][0] = 0.0; } else if (norm < SQR(64.0 * DBL_EPSILON) * error) { // If angle between A[0] and A[1] is too small, don't use t = SQR(A[0][1]); // cross product, but calculate v ~ (1, -A0/A1, 0) f = -A[0][0] / A[0][1]; if (SQR(A[1][1]) > t) { t = SQR(A[1][1]); f = -A[0][1] / A[1][1]; } if (SQR(A[1][2]) > t) f = -A[0][2] / A[1][2]; norm = 1.0/sqrt(1 + SQR(f)); Q[0][0] = norm; Q[1][0] = f * norm; Q[2][0] = 0.0; } else // This is the standard branch { norm = sqrt(1.0 / norm); for (j=0; j < 3; j++) Q[j][0] = Q[j][0] * norm; }

// Prepare calculation of second eigenvector t = w[0] - w[1]; if (fabs(t) > 8.0 * DBL_EPSILON * wmax) { // For non-degenerate eigenvalue, calculate second eigenvector by the formula // v[1] = (A - w[1]).e1 x (A - w[1]).e2 A[0][0] += t; A[1][1] += t; Q[0][1] = Q[0][1] + A[0][2] _w[1]; Q[1][1] = Q[1][1] + A[1][2]_w[1]; Q[2][1] = A[0][0]A[1][1] - Q[2][1]; norm = SQR(Q[0][1]) + SQR(Q[1][1]) + SQR(Q[2][1]); n0 = n0tmp + SQR(A[0][0]); n1 = n1tmp + SQR(A[1][1]); error = n0 \ n1;

if (n0 <= thresh) // If the first column is zero, then (1,0,0) is an eigenvector { Q[0][1] = 1.0; Q[1][1] = 0.0; Q[2][1] = 0.0; } else if (n1 <= thresh) // If the second column is zero, then (0,1,0) is an eigenvector { Q[0][1] = 0.0; Q[1][1] = 1.0; Q[2][1] = 0.0; } else if (norm < SQR(64.0 * DBL_EPSILON) * error) { // If angle between A[0] and A[1] is too small, don't use t = SQR(A[0][1]); // cross product, but calculate v ~ (1, -A0/A1, 0) f = -A[0][0] / A[0][1]; if (SQR(A[1][1]) > t) { t = SQR(A[1][1]); f = -A[0][1] / A[1][1]; } if (SQR(A[1][2]) > t) f = -A[0][2] / A[1][2]; norm = 1.0/sqrt(1 + SQR(f)); Q[0][1] = norm; Q[1][1] = f * norm; Q[2][1] = 0.0; } else { norm = sqrt(1.0 / norm); for (j=0; j < 3; j++) Q[j][1] = Q[j][1] * norm; }

} else { // For degenerate eigenvalue, calculate second eigenvector according to // v[1] = v[0] x (A - w[1]).e[i] //

// This would really get to complicated if we could not assume all of A to // contain meaningful values. A[1][0] = A[0][1]; A[2][0] = A[0][2]; A[2][1] = A[1][2]; A[0][0] += w[0]; A[1][1] += w[0]; for (i=0; i < 3; i++) { A[i][i] -= w[1]; n0 = SQR(A[0][i]) + SQR(A[1][i]) + SQR(A[2][i]); if (n0 > thresh) { Q[0][1] = Q[1][0]_A[2][i] - Q[2][0]_A[1][i]; Q[1][1] = Q[2][0]_A[0][i] - Q[0][0]_A[2][i]; Q[2][1] = Q[0][0]_A[1][i] - Q[1][0]_A[0][i]; norm = SQR(Q[0][1]) + SQR(Q[1][1]) + SQR(Q[2][1]); if (norm > SQR(256.0 * DBL_EPSILON) * n0) // Accept cross product only if the angle between { // the two vectors was not too small norm = sqrt(1.0 / norm); for (j=0; j < 3; j++) Q[j][1] = Q[j][1] * norm; break; } } }

if (i == 3) // This means that any vector orthogonal to v[0] is an EV. { for (j=0; j < 3; j++) if (Q[j][0] != 0.0) // Find nonzero element of v[0] ... { // ... and swap it with the next one norm = 1.0 / sqrt(SQR(Q[j][0]) + SQR(Q[(j+1)%3][0])); Q[j][1] = Q[(j+1)%3][0] * norm; Q[(j+1)%3][1] = -Q[j][0] * norm; Q[(j+2)%3][1] = 0.0; break; } }

}

// Calculate third eigenvector according to // v[2] = v[0] x v[1] Q[0][2] = Q[1][0]_Q[2][1] - Q[2][0]_Q[1][1]; Q[1][2] = Q[2][0]_Q[0][1] - Q[0][0]_Q[2][1]; Q[2][2] = Q[0][0]_Q[1][1] - Q[1][0]_Q[0][1];

endif

return 0; }

//This is the code that calculate the eigenvalues

//

// Numerical diagonalization of 3x3 matrcies // Copyright (C) 2006 Joachim Kopp

//

// This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

//

include

include

//#include "dsyevc3.h"

// Constants

define M_SQRT3 1.73205080756887729352744634151 // sqrt(3)

// Macros

define SQR(x) ((x)*(x)) // x^2

//

int dsyevc3(float A[3][3], float w[3])

//

// Calculates the eigenvalues of a symmetric 3x3 matrix A using Cardano's // analytical algorithm. // Only the diagonal and upper triangular parts of A are accessed. The access // is read-only.

//

// Parameters: // A: The symmetric input matrix // w: Storage buffer for eigenvalues

//

// Return value: // 0: Success // -1: Error

//

{ float m, c1, c0;

// Determine coefficients of characteristic poynomial. We write // | a d f | // A = | d* b e | // | f* e* c | float de = A[0][1] * A[1][2]; // d * e float dd = SQR(A[0][1]); // d^2 float ee = SQR(A[1][2]); // e^2 float ff = SQR(A[0][2]); // f^2 m = A[0][0] + A[1][1] + A[2][2]; c1 = (A[0][0]_A[1][1] + A[0][0]_A[2][2] + A[1][1]_A[2][2]) // a_b + a_c + b_c - d^2 - e^2 - f^2

(dd + ee + ff); c0 = A[2][2]_dd + A[0][0]_ee + A[1][1]_ff - A[0][0]_A[1][1] _A[2][2] - 2.0 * A[0][2]_de; // c_d^2 + a_e^2 + b_f^2 - a_b_c - 2_f_d_e)

float p, sqrt_p, q, c, s, phi; p = SQR(m) - 3.0 c1; q = m(p - (3.0/2.0)_c1) - (27.0/2.0)_c0; sqrt_p = sqrt(fabs(p));

phi = 27.0 * ( 0.25SQR(c1)(p - c1) + c0_(q + 27.0/4.0_c0)); phi = (1.0/3.0) * atan2(sqrt(fabs(phi)), q);

c = sqrt_p _cos(phi); s = (1.0/M_SQRT3)_sqrt_p*sin(phi);

w[1] = (1.0/3.0)*(m - c); w[2] = w[1] + s; w[0] = w[1] + c; w[1] -= s;

return 0; }

— Reply to this email directly or view it on GitHub https://github.com/halide/Halide/issues/1026#issuecomment-188904265.

One of those sentences was pretty convoluted. To be clear: If I were writing this algorithm, I think I would produce faster code doing it in Halide than doing it by hand with intrinsics or assembly.

halide / Halide

One of the stages of the pipeline can not be expressed as a pure function. #1026

define ROTATE(a,i,j,k,l) g=a[i][j];h=a[k][l];a[i][j]=g-s_(h+g_tau);\

undef ROTATE

include

define SQR(x) ((x)*(x)) // x^2

ifndef EVALS_ONLY

endif

ifndef EVALS_ONLY

endif

include

include

define M_SQRT3 1.73205080756887729352744634151 // sqrt(3)

define SQR(x) ((x)*(x)) // x^2

include

define SQR(x) ((x)*(x)) // x^2

//

//

//

//

//

//

ifndef EVALS_ONLY

endif

ifndef EVALS_ONLY

endif

//

//

//

include

include

define M_SQRT3 1.73205080756887729352744634151 // sqrt(3)

define SQR(x) ((x)*(x)) // x^2

//

//

//

//

//