Challenges Implementing LU factorization in Halide's algorithm DSL with autosheduler

Rehanchy commented 2 months ago

Dear Halide Developers,

We are new to Halide and are exploring the implementation of various linear algebra algorithms using Halide DSL. Particularly, we want to leverage the autoscheduler and test Halide's performance.

But we've encountered some challenges when trying to implement LU factorization(with gaussian elimination) using Halide. It seems that we are not able to set variables as the loop boundaries like /(for int i = 0; i < colSize; i++)/, and it's also hard to iteratively update matrix data (for example, the update of matrix A will be row by row, the ith update will be based on matrix A's result after the i-1th update) , since LU factorization have three levels of nested loop and we didn't figure out a way to use RDom to construct this pattern of iterations.

We did some investigation to see whether there are others previously asking similar questions and didn't find many, we saw some answers like using extern C to construct the loop and call Halide functions to do the computation inside the loop, but we are not sure about that.

Thus, we wish to ask your opinions on this issue, are there known workarounds or strategies within Halide that might allow us to overcome these challenges? Or could this be seen as a fundamental limitation of the Halide DSL?

We would greatly appreciate any guidance or corrections you could provide, thanks a lot!

mcourteaux commented 2 months ago

I was tempted to try to write a Pipeline that does Gaussian Elimination. After some fiddling, I really think Halide is not designed for these type of algorithms. I attempted to make the outerloop of the row-reduction algorithm a pure variable, and then make an RDom that iterates over those iterations and copies the matrix over from the previous step, plus the row-operation required at that step.

Doing something simple like that, which is still a long way from a whole LU factorization, looks like this:

Buffer<float> matrix(10, 10);
Func rref{"rref"};

Expr num_rows = matrix.dim(1).extent();
Expr num_cols = matrix.dim(0).extent();
Var row{"row"}, col{"col"};
Var i{"iteration"};

// (1) Copy matrix over to the first iteration:
rref(col, row, i) = select(i == 0, matrix(col, row), undef<float>());

// (2) Iterate over the row-operations
RDom it = {1, num_rows - 1, "iteration"};
Expr scale = rref(col, col, it - 1) / rref(it, it, it - 1);
rref(col, row, it) = rref(col, row, it - 1) - select(row > it, scale * rref(col, it - 1, it - 1), 0.0f);

Buffer<float> out = rref.realize({10, 10, 10});

But gave me this error:

User error triggered at /home/martijn/3rd/halide/src/Function.cpp:229 Condition failed: var && var->name == pure_args[i] Error: In definition of Func "rref": All of a function's recursive references to itself must contain the same pure variables in the same places as on the left-hand-side.

So I fear that writing at least one outer-loop yourself in C and then calling the Halide pipeline to do the work inside that iteration will be necessary. Still, it will be difficult to work around the fact that a normal Gaussian Elimination algorithm works in-place. Halide won't like that, as you saw above. So you'd need two buffers to ping-pong between to get to the final result. Additionally, doing multiple update-definitions (dependent on an RDom) are semantically always executed after the previous one. So for example in the case of LU decomposition, interleaving the operations on matrix L and P is not possible, unless you break up the Pipeline and call it in loop. By that point, you're close to just having written the algorithm yourself in C either way. The only reason you might want to try Halide is to do vectorization within the row-operations, and potentially parallelization in case of huge matrices, as the row operations are independent.

Overall, while an interesting question, I don't think Halide is anywhere near the right tool for this job. I'm by no means a Halide expert, so perhaps I'm unaware of some tricks to get this to work.

abadams commented 2 months ago

Martijn's experience with these kinds of mostly-serial usually-in-place linear algebra operations matches mine. Sometimes you can make it work, but I don't think Halide is the right tool for it. If you really want Halide to get portable vectorized code, you can often just write the innermost loop in Halide and call it from C as Martijn describes. I have done this in the past in a few cases where Halide was already a dependency and I didn't have any other easy way to get a portably-vectorized inner loop. But if you can trust your C compiler to autovectorize, Halide doesn't bring much value. There's no scheduling space to explore.

Rehanchy commented 2 months ago

All of a function's recursive references to itself must contain the same pure variables in the same places as on the left-hand-side.

This is exactly one of the errors I constantly see when trying to implement the LU factorization in Halide DSL.

Thanks for the very detailed replies and suggestions Martijn and Andrew, that helps a lot!

halide / Halide

Challenges Implementing LU factorization in Halide's algorithm DSL with autosheduler #8398