Open kumasento opened 3 years ago
I've got a theory. The performance difference could be caused by the final clang
to produce the executable.
The Polymer version uses -O3
:
while the Pluto version doesn't (because we just want to do -O3
once, and we use -O3
to generate LLVM-IR):
If I remove that -O3
from the Polymer version (as in commit 990774f), the Polymer run time becomes 185.876198, which is pretty close to what we get from Pluto.
Interestingly, if we do the other way around, by adding -O3
to the last clang compile in the Pluto version (as the paper version which uses double -O3), the Pluto version becomes faster 157.360520, which is still slower than the Polymer version with final -O3
(around 134 s). Using clang -O3
to compile seidel-2d.pluto.c
end-to-end to the executable can give similar run time: 158.388780
As a summary, there are four different options when compiling Polymer & Pluto production:
Polymer | Pluto |
---|---|
mlir-opt + clang -O3 / 134.6 | clang -O3 -emit-llvm + clang / 186.5 |
mlir-opt + clang -O3 / 134.6 | clang -O3 -emit-llvm + clang -O3 / 157.4 |
mlir-opt + clang / 185.9 | clang -O3 -emit-llvm + clang / 186.5 |
mlir-opt + clang -O3 / 134.6 | clang -emit-llvm + clang -O3 / 218.7 |
mlir-opt + clang -O3 / 134.6 | clang -O3 / 158.4 |
All four options have their own reasoning. Just that the 3rd option can give the smallest performance gap.
Additional questions:
This feels similar to our earlier analysis. What happens if you do O1 or O2. Definitely some post-mlir optimizations are desired (to say inline and/or make memrefs have a more reasonable representation). That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization.
What happens if you do O1 or O2
I suppose you meant doing O1 or O2 when emitting LLVM for Pluto? This won't change much the performance AFAIK.
That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization.
The tricky part is that the final MLIR code is just a perfectly nested loop, and it couldn't be as efficient as Pluto after -O3 -emit-llvm (see the boundary calculation part above). So there could be some significant changes in the final clang -O3 step makes things very different I suppose 😅
Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.
I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.
For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.
I don't expect this to be a big change. The relevant values are likely still in the same register (%eax
vs %rax
). It may be necessary to zero it out at the beginning to implement zext
, but it shouldn't be a huge difference.
Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,
It's clang that adds them, I suppose, based on some modeling of integers in C++. SIgned integer overflow/underflow is undefined behavior, and the standard allows the compiler to assume programs don't have undefined behavior, which implies nuw
and nsw
pretty much automatically for any operations that were originally performed on signed integers.
You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.
Again, I suppose clang may be implementing strength reduction optimization at the language level. I think we should rather look at the fully optimized IR or even assembly, because that's what is ultimately executed. Looking at unoptimized code will not give us much information unless we know precisely how it is going to be optimized. I would suggest starting with optimized IR, finding the differences there and trying to work back to unoptimized IR.
If I remove that -O3 from the Polymer version (as in commit 990774f), the Polymer run time becomes 185.876198, which is pretty close to what we get from Pluto.
I don't think we should care much about unoptimized code. In particular, memref descriptor may be quite expensive without SROA + constant-folding and inst-combine. It is weird though that calling opt
manually leads to apparently more optimizations on the LLVM IR than within clang.
Is there anyway to investigate the difference in what the last clang -O3 can do for Polymer's LLVM-IR and Pluto's?
Stare at the optimized IR for both cases?
I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.
Yup, I managed to do that in a recent update, by calling inline after lowering affine.
It is weird though that calling opt manually leads to apparently more optimizations on the LLVM IR than within clang.
Would you mind elaborating more on this? I might have described the results in a confusing way: I was trying to say that if you use clang -O3 -emit-llvm + opt -O3 + clang (without -O3), the performance is lower than calling clang -O3 -emit-llvm + clang -O3 itself
Stare at the optimized IR for both cases?
I suppose so. Will do that the first thing today.
So I've got the assembly here (by simply adding -save-temps
to the final clang -O3
):
Looking at the loop body -
Pluto:
Polymer:
It seems that the address calculation is different. Would this affect the performance?
Specifically, are these two -
equivalent?
OK I think I'm closer to the final answer. If we do the following when compiling the Pluto version of seidel-2d -
long long
;-O3
to both the LLVM IR emission and the assembly code generation, i.e., clang -O3 -S -emit-llvm + clang -O3To reproduce:
./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c -t i64
We can get 136.785956, which is very close to the Polymer result (134.634325 in the first post, 135.248473 in my latest run).
Recall that the result from using the same compilation options but with the int
type is ~157.4 (sorry for the difference in significant figures), the performance gain is significant. This makes sense if we look at the assembly:
which is more compact than:
Extremely minor, what happens if you do unsigned long long?
On Mon, Dec 21, 2020 at 12:35 PM Ruizhe Zhao notifications@github.com wrote:
OK I think I'm closer to the final answer. If we do the following when compiling the Pluto version of seidel-2d -
- Change the type of IV and bound to long long;
- Apply -O3 to both the LLVM IR emission and the assembly code generation, i.e., clang -O3 -S -emit-llvm + clang -O3
We can get 136.785956, which is very close to the Polymer result (134.634325 in the first post, 135.248473 in my latest run).
Recall that the result from using the same compilation options but with the int type is ~157.4 (sorry for the difference in significant figures), the performance gain is significant. This makes sense if we look at the assembly:
which is more compact than:
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/kumasento/polymer/issues/71#issuecomment-749101709, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXDOA54WDMO7TNXTTI3SV6BMNANCNFSM4VC6QL6Q .
A summary for the run time of different Pluto configurations
Source code | Emit LLVM | Compile to Executable (llc ) |
Run time |
---|---|---|---|
Default | clang -O3 -S -emit-llvm |
clang -O3 |
157.4 |
Default | clang -S -emit-llvm |
clang -O3 |
218.7 |
Default | clang -O3 -S -emit-llvm |
clang |
186.5 |
i64 |
clang -O3 -S -emit-llvm |
clang -O3 |
136.8 |
Extremely minor, what happens if you do unsigned long long?
Good question! The result will be wrong because there are some boundaries should give negative values.
For example-
Here t2
can be 0, which means the result of ceild(64*t2-_PB_N-28,32)
could be negative.
What if you did the sort of loop reversal to ensure the signedness of the induction variable we did in the frontend? This is sufficiently close (136 vs 137) that I'd be happy, but I wonder if the signed-ness of adds/etcs matters here in the marginal additional bits.
I do absolutely believe the width of the index would make this difference, though.
Interesting.
What if you did the sort of loop reversal to ensure signed we did in the frontend?
Is there possibly any option I could just use in clang
to do that. And I'm not sure what ensure signed actually means. Would you mind giving a bit more info on this?
Revised comment above to be more descriptive. Basically we rewrite loops with negative steps here (https://github.com/wsmoses/MLIR-GPU/blob/eddfef09a9078c3c7677a105ae016cafb2fd8559/mlir/tools/mlir-clang/Lib/clang-mlir.cc#L363) to have a positive step from 0 to # of iterations, then do the math previously linked to get the actual iteration value.
This ensures that the actual for loop induction variable is unsigned, which perhaps matters here? I'm assuming that the place where there's a signed induction variable only happens with a negative step (correct me if I'm wrong).
I see, thanks for these details. Just to confirm: do you mean doing transformation like -
from
for (i = -n; i < m; i ++)
A[i] = B[i] + C[i];
to
for (i = 0; i < n + m; i ++)
A[i-n] = B[i-n] + C[i-n];
This ensures that the actual for loop induction variable is unsigned, which perhaps matters here?
I'm not sure, especially that the MLIR code that Polymer produces may contain negative boundaries as well.
which could finally result in very similar code as Pluto produces.
Yeah that's precisely the transformation that happens.
I see, then I'm a bit skeptical that applying this transformation can further narrow the perf gap. As I mentioned above, the MLIR code produced by Polymer doesn't have that transformation applied neither. Maybe that transformation can improve the perf for both results, but it is not likely that their perf can become closer 😅
Oh apologies, misread the transformation a bit. The transformation only runs for scenarios with a negative step (not applied here).
So (and not checking off-by-one's here):
for (i = B; i >= E; i--)
A[i] = B[i] + C[i];
for (i = 0; i < B-E; i ++) {
idx = B-i;
A[idx] = B[idx] + C[idx];
}
It's actually a bit more messy in a way that I'm trying to clean up now.
Positive step will remain the same.
UPDATE: There is a tentative conclusion on this matter - to make sure that the Pluto version achieves the same performance as the Polymer version, it should use
i64
for its induction variables and bounaries (which removes loads of sext and trunc), and-O3
should be applied for both LLVM-IR emission and final executable compilation (previously we just use-O3
for emitting the LLVM IR).Numbers can be found at: https://github.com/kumasento/polymer/issues/71#issuecomment-749104438
This issue is a informal report on the whole investigation procedure.
Setup
To reproduce:
seidel-2d-debug
contains the EXTRALARGE version of the seidel-2d benchmark code. Runningeval-perf
compiles and runs the given code by both Pluto and Polymer, and it will output the overall runtime and keep important intermediate results.Pluto CLI (polycc) and Polymer use the same options, i.e., no parallel, vectorization, or unroll-and-jam.
Vectorization is disabled at the clang level using
-fno-vectorize
(you may investigate its effect by checking the LLVM IR code).Only one
-O3
is applied while generating the LLVM IR.The host machine doesn't have any of its multi-core configurations disabled, e.g., hyper-threading config stays the same.
Notice the difference
After running the
eval-perf
script given above, we have:The performance gap is huge.
Check the CLooG code
We normally need to check the schedule first, but since their CLooG code are both pretty compact, we just skipped the schedule checking and directly look at the generated CLooG code.
From Polymer
https://github.com/kumasento/polymer/blob/a086115385a14af71aa928faa29f4225457adea9/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.cloog#L1-L15
From Pluto
https://github.com/kumasento/polymer/blob/a086115385a14af71aa928faa29f4225457adea9/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c#L84-L98
I won't say I'm very focused when comparing these two, but as far as I can tell, they are identical regarding the nested loops.
Check the LLVM IR code
Now we move into the dark domain.
We check two different things: the complicated loop boundary calculation and the loop body.
Loop body
First of all, without additional flags, the Pluto result uses vectorization (i.e.,
-fno-vectorize
is not applied), and its run time is200.756374
, about 7.6% slower than the version without vectorization (shown above).Now let's have a closer look at the loop body:
Polymer:
https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll#L263-L343
Pluto:
https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll#L301-L340
Note that Polymer's loop body is not inlined, so we should look at the body of
S0
instead.Address Calculation
The major difference between the two is the usage of
getelementptr
. Polymer explicitly calculates the flattened 1-d address, e.g.,https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll#L274-L276
while Pluto delegates the task to getelementptr:
https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll#L307
Since
getelementptr
in both cases should provides the same address, the difference shouldn't be very high (not empirically verified).Also, in Pluto since the loop IVs are
i32
typed, we should explicitly callsext
to extend them toi64
before passing togetelementptr
.For this case, we manually changed the Pluto generated C code by updating the loop IV type from
int
tolong long
.https://github.com/kumasento/polymer/blob/857a9a5fb7ce4f446175f4b6b704170ce6fca261/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.c#L82
and the
sext
s are eliminated (as well as thosetrunc
).https://github.com/kumasento/polymer/blob/857a9a5fb7ce4f446175f4b6b704170ce6fca261/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.ll#L261-L262
However, the overall performance drops from around
187
to195.4
, which may suggest that using int is not the main reason for the gap between Polymer and Pluto.Finally, Pluto adds
inbounds
while Polymer doesn't. In theory, this shouldn't affect much the performance. We have also manually removedinbounds
and run again, and the result is roughly the same.Computation Sequence
Pluto and Polymer computes the seidel-2d kernel in the same order, from top-left to bottom-right.
Conclusion
Loop body may not be the major source for the performance gap.
Boundary calculation
Boundary calculation is a bit harder to look into. So far, there is two differences that I found might affect the performance.
nsw
andnuw
Pluto generated LLVM-IR adds
nsw
and/ornuw
to those address calculation related binary operators, e.g.,https://github.com/kumasento/polymer/blob/857a9a5fb7ce4f446175f4b6b704170ce6fca261/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll#L244-L252
It is unknown that whether this could have significant effect on the performance. Manually removing them won't produce any better result.
Direct translation vs optimized
(Sorry if I use the wrong term)
There are plenty of
floordiv
andceildiv
in the loop bounds. The Polymer version have them directly compiled to corresponding LLVM IR operators, e.g.,sdiv
,mul
, etc., while Pluto extensively optimizes them by doing constant folding or other strategies.Take the outermost loop as an example:
https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c#L85
In Polymer,
https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll#L362-L376
You can find out that
%32
isfloord(_PB_STEPS, 32) + 1
, which is calculated mainly bysdiv
, and%34
is the loop IVt1
.In Pluto:
https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll#L375-L385
The condition,
%exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32
, basically comparest1 + 1
(%indvars.iv.next93.i
) with32
.32
is the actual result fromfloord(_PB_STEPS, 32)
considering_PB_STEPS=1000
.You may also notice that in Polymer you can find
mul
by64
, while in Pluto these are optimised intoshr
by6
.Conclusion
We still cannot have a solid conclusion on which part that dominates the performance gap. Especially, Pluto's version seems to be more optimized than the Polymer version regarding address calculation. It is very likely that the difference in LLVM-IR is not the major cause for performance difference.