Open llvmbot opened 8 years ago
Thank you Michael, this was really helpful. I changed the SCEV analysis accordingly and it indeed solved this problem for trmm
and the provided test case. I am currently also experimenting with other Julia programs and there are other situations as well where the lowering of for
loops to LLVM produces code that cannot be handled by SCEV. I will wait with a patch until I know if further changes to SCEV are required.
Best, Matthias
From kernel_trmm.ll :
%30 = add i64 %"#temp#1.030", 1 %36 = icmp sgt i64 %30, %12 %37 = select i1 %36, i64 %"#temp#1.030", i64 %12
ie: %37 = ((%"#temp#1.030" + 1) > %12) ? "#temp#1.030" : %12
This is not detected as an smax expression in ScalarEvolution::createNodeForSelectOrPHI because what is compared in icmp is not the same as in the select (%30 vs. %"#temp#1.030").
This equivalent code is detected as expected:
%36 = icmp sge i64 %"#temp#1.030", %12 %37 = select i1 %36, i64 %"#temp#1.030", i64 %12
ie: %37 = (%"#temp#1.030" >= %12) ? "#temp#1.030" : %12
I see two possible solutions:
1.) Make the Julia compiler generate the second variant.
2.) Teach ScalarEvolution::createNodeForSelectOrPHI to detect this pattern. Something like this:
case ICmpInst::ICMP_SGT: if (LDiff == RDiff || getAddRec(LDiff, 1) == RDiff) return getAddExpr(getSMaxExpr(LS, RS), LDiff);
(Take care for the case when %"#temp#1.030" has its maximum value)
This would be a change to LLVM itself and therefore should be reviewed by LLVM people as well.
I've attached another test case now that shows that Polly is able to handle select
instructions in simpler situations, namely, when select
is used to model an smax
or umax
expression. However, Polly relies on the SCEV Analysis pass to detect this particular usage of select
. The first test case, that I attached earlier, fails only because the SCEV Analysis fails to detect that select
is used to compute the maximum of two values.
That means, at the moment, Polly does not handle select
s explicitly. The question that now arises is whether it is possible to model such instructions effectively. I assume in general they could be modeled like regular IR branches, but maybe that might not be very efficient? Another possibility would be to generalize the SCEV Analysis to detect more smax
/umax
patterns, however I am not very familiar with the internals of SCEV, so I cannot really estimate if that would be reasonable. And another possibility would be to change Julia's code generator to directly use smax
instructions here instead of select
which would have to be discussed on the Julia mailing list.
Is there a preferred way how this could be handled? I'm looking forward to hear your opinion on this.
Best, Matthias
trmm
, Reduced test case, Passing test caseExtended Description
I ported PolyBench's
trmm
benchmark to Julia to test Polly on the LLVM IR that Julia produces when compiling it. The according Julia function that I used for this purpose looks as follows:@polly function kernel_trmm(alpha, A, B) m,n = size(B) for i = 1:m, j = 1:n for k = (i+1):m B[i,j] += A[k,i] B[k,j] end B[i,j] = alpha B[i,j] end end
The LLVM code that Julia emits for this function (when disabling bounds-checks via
--check-bounds=no
) is attached askernel_trmm.ll
.I invoked Polly the following way on this LLVM code:
$ opt -debug-only=polly-detect -polly-codegen -S llvm/kernel_trmm.ll > /dev/null
This produces the following output:
Checking region: top =>
Top level region is invalid
Checking region: top.split => L10
Non affine loop bound '({-2,+,-1}<%if> + %37)' in loop: if12
Checking region: if => L.L10_crit_edge
Non affine loop bound '({-2,+,-1}<%if> + %37)' in loop: if12
Checking region: if => L.loopexit
Non affine loop bound '({-2,+,-1}<%if> + %37)' in loop: if12
Checking region: if11 => L4.L.loopexit_crit_edge
Non affine loop bound '({-2,+,-1}<%if> + %37)' in loop: if12
Checking region: if11 => L7
Non affine loop bound '({-2,+,-1}<%if> + %37)' in loop: if12
Checking region: if12 => L5.L7_crit_edge
OK
Expanding if12 => L5.L7_crit_edge
Trying if12 => L7
to if12 => L7
Region can not profitably be optimized!
For computing loop bounds Julia emits
select
instructions and it seems that they are the source of the problem. This issue has also been subject to discussion on the polly-dev mailing list: https://groups.google.com/forum/#!topic/polly-dev/P2RZUlKRo0I The original C version oftrmm
however, when compiled via clang, can be optimized by Polly without problems.I am currently working on a solution and hope to come up with new findings as soon as possible.