Investigating the performance difference between Pluto and Polymer on seidel-2d.

kumasento commented 3 years ago

UPDATE: There is a tentative conclusion on this matter - to make sure that the Pluto version achieves the same performance as the Polymer version, it should use i64 for its induction variables and bounaries (which removes loads of sext and trunc), and -O3 should be applied for both LLVM-IR emission and final executable compilation (previously we just use -O3 for emitting the LLVM IR).

Numbers can be found at: https://github.com/kumasento/polymer/issues/71#issuecomment-749104438

This issue is a informal report on the whole investigation procedure.

Setup

To reproduce:

git checkout debug-seidel-2d

cd example/polybench
./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c

seidel-2d-debug contains the EXTRALARGE version of the seidel-2d benchmark code. Running eval-perf compiles and runs the given code by both Pluto and Polymer, and it will output the overall runtime and keep important intermediate results.

Pluto CLI (polycc) and Polymer use the same options, i.e., no parallel, vectorization, or unroll-and-jam.

Vectorization is disabled at the clang level using -fno-vectorize (you may investigate its effect by checking the LLVM IR code).

Only one -O3 is applied while generating the LLVM IR.

The host machine doesn't have any of its multi-core configurations disabled, e.g., hyper-threading config stays the same.

Notice the difference

After running the eval-perf script given above, we have:

Pluto run time:      186.504164
Polymer run time: 134.634325

The performance gap is huge.

Check the CLooG code

We normally need to check the schedule first, but since their CLooG code are both pretty compact, we just skipped the schedule checking and directly look at the generated CLooG code.

From Polymer

https://github.com/kumasento/polymer/blob/a086115385a14af71aa928faa29f4225457adea9/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.cloog#L1-L15

From Pluto

https://github.com/kumasento/polymer/blob/a086115385a14af71aa928faa29f4225457adea9/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c#L84-L98

I won't say I'm very focused when comparing these two, but as far as I can tell, they are identical regarding the nested loops.

Check the LLVM IR code

Now we move into the dark domain.

We check two different things: the complicated loop boundary calculation and the loop body.

Loop body

First of all, without additional flags, the Pluto result uses vectorization (i.e., -fno-vectorize is not applied), and its run time is 200.756374, about 7.6% slower than the version without vectorization (shown above).

Now let's have a closer look at the loop body:

Polymer:

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll#L263-L343

Pluto:

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll#L301-L340

Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.

Address Calculation

The major difference between the two is the usage of getelementptr. Polymer explicitly calculates the flattened 1-d address, e.g.,

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll#L274-L276

while Pluto delegates the task to getelementptr:

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll#L307

Since getelementptr in both cases should provides the same address, the difference shouldn't be very high (not empirically verified).

Also, in Pluto since the loop IVs are i32 typed, we should explicitly call sext to extend them to i64 before passing to getelementptr.

For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.

https://github.com/kumasento/polymer/blob/857a9a5fb7ce4f446175f4b6b704170ce6fca261/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.c#L82

and the sexts are eliminated (as well as those trunc).

https://github.com/kumasento/polymer/blob/857a9a5fb7ce4f446175f4b6b704170ce6fca261/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.ll#L261-L262

However, the overall performance drops from around 187 to 195.4, which may suggest that using int is not the main reason for the gap between Polymer and Pluto.

Finally, Pluto adds inbounds while Polymer doesn't. In theory, this shouldn't affect much the performance. We have also manually removed inbounds and run again, and the result is roughly the same.

Computation Sequence

Pluto and Polymer computes the seidel-2d kernel in the same order, from top-left to bottom-right.

Conclusion

Loop body may not be the major source for the performance gap.

Boundary calculation

Boundary calculation is a bit harder to look into. So far, there is two differences that I found might affect the performance.

`nsw` and `nuw`

Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,

https://github.com/kumasento/polymer/blob/857a9a5fb7ce4f446175f4b6b704170ce6fca261/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll#L244-L252

It is unknown that whether this could have significant effect on the performance. Manually removing them won't produce any better result.

Direct translation vs optimized

(Sorry if I use the wrong term)

There are plenty of floordiv and ceildiv in the loop bounds. The Polymer version have them directly compiled to corresponding LLVM IR operators, e.g., sdiv, mul, etc., while Pluto extensively optimizes them by doing constant folding or other strategies.

Take the outermost loop as an example:

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c#L85

In Polymer,

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.ll#L362-L376

You can find out that %32 is floord(_PB_STEPS, 32) + 1, which is calculated mainly by sdiv, and %34 is the loop IV t1.

In Pluto:

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.ll#L375-L385

The condition, %exitcond124.not.i = icmp eq i64 %indvars.iv.next93.i, 32, basically compares t1 + 1 (%indvars.iv.next93.i) with 32. 32 is the actual result from floord(_PB_STEPS, 32) considering _PB_STEPS=1000.

You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.

Conclusion

We still cannot have a solid conclusion on which part that dominates the performance gap. Especially, Pluto's version seems to be more optimized than the Polymer version regarding address calculation. It is very likely that the difference in LLVM-IR is not the major cause for performance difference.

kumasento commented 3 years ago

I've got a theory. The performance difference could be caused by the final clang to produce the executable.

The Polymer version uses -O3:

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/eval-perf#L137-L138

while the Pluto version doesn't (because we just want to do -O3 once, and we use -O3 to generate LLVM-IR):

https://github.com/kumasento/polymer/blob/b08410c58f416537538d3ea1eb88b2a4597a7368/example/polybench/eval-perf#L94-L95

If I remove that -O3 from the Polymer version (as in commit 990774f), the Polymer run time becomes 185.876198, which is pretty close to what we get from Pluto.

Interestingly, if we do the other way around, by adding -O3 to the last clang compile in the Pluto version (as the paper version which uses double -O3), the Pluto version becomes faster 157.360520, which is still slower than the Polymer version with final -O3 (around 134 s). Using clang -O3 to compile seidel-2d.pluto.c end-to-end to the executable can give similar run time: 158.388780

As a summary, there are four different options when compiling Polymer & Pluto production:

Polymer	Pluto
mlir-opt + clang -O3 / 134.6	clang -O3 -emit-llvm + clang / 186.5
mlir-opt + clang -O3 / 134.6	clang -O3 -emit-llvm + clang -O3 / 157.4
mlir-opt + clang / 185.9	clang -O3 -emit-llvm + clang / 186.5
mlir-opt + clang -O3 / 134.6	clang -emit-llvm + clang -O3 / 218.7
mlir-opt + clang -O3 / 134.6	clang -O3 / 158.4

All four options have their own reasoning. Just that the 3rd option can give the smallest performance gap.

Additional questions:

Is there anyway to investigate the difference in what the last clang -O3 can do for Polymer's LLVM-IR and Pluto's?
I tried another option for Pluto: clang -O3 -emit-llvm + opt -O3 + clang, which results in similar perf as option 1 & 3. In fact, by checking the input and output of opt -O3, we can notice they are very similar. This may suggest that optimizing other components when generating the executable is more important.

wsmoses commented 3 years ago

This feels similar to our earlier analysis. What happens if you do O1 or O2. Definitely some post-mlir optimizations are desired (to say inline and/or make memrefs have a more reasonable representation). That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization.

kumasento commented 3 years ago

What happens if you do O1 or O2

I suppose you meant doing O1 or O2 when emitting LLVM for Pluto? This won't change much the performance AFAIK.

That said I wonder if some of the affine LICM or other optimizations we do in advance make an impact and a subsequent O3 does even more optimization.

The tricky part is that the final MLIR code is just a perfectly nested loop, and it couldn't be as efficient as Pluto after -O3 -emit-llvm (see the boundary calculation part above). So there could be some significant changes in the final clang -O3 step makes things very different I suppose 😅

ftynse commented 3 years ago

Note that Polymer's loop body is not inlined, so we should look at the body of S0 instead.

I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.

For this case, we manually changed the Pluto generated C code by updating the loop IV type from int to long long.

I don't expect this to be a big change. The relevant values are likely still in the same register (%eax vs %rax). It may be necessary to zero it out at the beginning to implement zext, but it shouldn't be a huge difference.

Pluto generated LLVM-IR adds nsw and/or nuw to those address calculation related binary operators, e.g.,

It's clang that adds them, I suppose, based on some modeling of integers in C++. SIgned integer overflow/underflow is undefined behavior, and the standard allows the compiler to assume programs don't have undefined behavior, which implies nuw and nsw pretty much automatically for any operations that were originally performed on signed integers.

You may also notice that in Polymer you can find mul by 64, while in Pluto these are optimised into shr by 6.

Again, I suppose clang may be implementing strength reduction optimization at the language level. I think we should rather look at the fully optimized IR or even assembly, because that's what is ultimately executed. Looking at unoptimized code will not give us much information unless we know precisely how it is going to be optimized. I would suggest starting with optimized IR, finding the differences there and trying to work back to unoptimized IR.

If I remove that -O3 from the Polymer version (as in commit 990774f), the Polymer run time becomes 185.876198, which is pretty close to what we get from Pluto.

I don't think we should care much about unoptimized code. In particular, memref descriptor may be quite expensive without SROA + constant-folding and inst-combine. It is weird though that calling opt manually leads to apparently more optimizations on the LLVM IR than within clang.

Is there anyway to investigate the difference in what the last clang -O3 can do for Polymer's LLVM-IR and Pluto's?

Stare at the optimized IR for both cases?

kumasento commented 3 years ago

I think MLIR's inliner is now capable of handling this, it should be just a matter of calling it.

Yup, I managed to do that in a recent update, by calling inline after lowering affine.

It is weird though that calling opt manually leads to apparently more optimizations on the LLVM IR than within clang.

Would you mind elaborating more on this? I might have described the results in a confusing way: I was trying to say that if you use clang -O3 -emit-llvm + opt -O3 + clang (without -O3), the performance is lower than calling clang -O3 -emit-llvm + clang -O3 itself

Stare at the optimized IR for both cases?

I suppose so. Will do that the first thing today.

kumasento commented 3 years ago

So I've got the assembly here (by simply adding -save-temps to the final clang -O3):

Looking at the loop body -

Pluto:

https://github.com/kumasento/polymer/blob/600d2fea0797775d93c47906e0c4f81325d9c767/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s#L514-L550

Polymer:

https://github.com/kumasento/polymer/blob/600d2fea0797775d93c47906e0c4f81325d9c767/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.s#L1261-L1289

It seems that the address calculation is different. Would this affect the performance?

Specifically, are these two -

https://github.com/kumasento/polymer/blob/600d2fea0797775d93c47906e0c4f81325d9c767/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s#L526-L528

https://github.com/kumasento/polymer/blob/600d2fea0797775d93c47906e0c4f81325d9c767/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.s#L1269

equivalent?

kumasento commented 3 years ago

OK I think I'm closer to the final answer. If we do the following when compiling the Pluto version of seidel-2d -

Change the type of IV and bound to long long;
Apply -O3 to both the LLVM IR emission and the assembly code generation, i.e., clang -O3 -S -emit-llvm + clang -O3

To reproduce:

./eval-perf -f ./seidel-2d-debug/seidel-2d/seidel-2d.c -t i64

We can get 136.785956, which is very close to the Polymer result (134.634325 in the first post, 135.248473 in my latest run).

Recall that the result from using the same compilation options but with the int type is ~157.4 (sorry for the difference in significant figures), the performance gain is significant. This makes sense if we look at the assembly:

https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.s#L483-L505

which is more compact than:

https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s#L514-L550

wsmoses commented 3 years ago

Extremely minor, what happens if you do unsigned long long?

On Mon, Dec 21, 2020 at 12:35 PM Ruizhe Zhao notifications@github.com wrote:

OK I think I'm closer to the final answer. If we do the following when compiling the Pluto version of seidel-2d -

Change the type of IV and bound to long long;

Apply -O3 to both the LLVM IR emission and the assembly code generation, i.e., clang -O3 -S -emit-llvm + clang -O3

We can get 136.785956, which is very close to the Polymer result (134.634325 in the first post, 135.248473 in my latest run).

Recall that the result from using the same compilation options but with the int type is ~157.4 (sorry for the difference in significant figures), the performance gain is significant. This makes sense if we look at the assembly:

https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.i64.s#L483-L505

which is more compact than:

https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.s#L514-L550

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/kumasento/polymer/issues/71#issuecomment-749101709, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXDOA54WDMO7TNXTTI3SV6BMNANCNFSM4VC6QL6Q .

kumasento commented 3 years ago

A summary for the run time of different Pluto configurations

Source code	Emit LLVM	Compile to Executable (`llc`)	Run time
Default	`clang -O3 -S -emit-llvm`	`clang -O3`	157.4
Default	`clang -S -emit-llvm`	`clang -O3`	218.7
Default	`clang -O3 -S -emit-llvm`	`clang`	186.5
`i64`	`clang -O3 -S -emit-llvm`	`clang -O3`	136.8

kumasento commented 3 years ago

Extremely minor, what happens if you do unsigned long long?

Good question! The result will be wrong because there are some boundaries should give negative values.

For example-

https://github.com/kumasento/polymer/blob/600d2fea0797775d93c47906e0c4f81325d9c767/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.pluto.c#L87

Here t2 can be 0, which means the result of ceild(64*t2-_PB_N-28,32) could be negative.

wsmoses commented 3 years ago

What if you did the sort of loop reversal to ensure the signedness of the induction variable we did in the frontend? This is sufficiently close (136 vs 137) that I'd be happy, but I wonder if the signed-ness of adds/etcs matters here in the marginal additional bits.

I do absolutely believe the width of the index would make this difference, though.

kumasento commented 3 years ago

Interesting.

What if you did the sort of loop reversal to ensure signed we did in the frontend?

Is there possibly any option I could just use in clang to do that. And I'm not sure what ensure signed actually means. Would you mind giving a bit more info on this?

wsmoses commented 3 years ago

Revised comment above to be more descriptive. Basically we rewrite loops with negative steps here (https://github.com/wsmoses/MLIR-GPU/blob/eddfef09a9078c3c7677a105ae016cafb2fd8559/mlir/tools/mlir-clang/Lib/clang-mlir.cc#L363) to have a positive step from 0 to # of iterations, then do the math previously linked to get the actual iteration value.

This ensures that the actual for loop induction variable is unsigned, which perhaps matters here? I'm assuming that the place where there's a signed induction variable only happens with a negative step (correct me if I'm wrong).

kumasento commented 3 years ago

I see, thanks for these details. Just to confirm: do you mean doing transformation like -

from

for (i = -n; i < m; i ++)
  A[i] = B[i] + C[i];

to

for (i = 0; i < n + m; i ++)
   A[i-n] = B[i-n] + C[i-n];

This ensures that the actual for loop induction variable is unsigned, which perhaps matters here?

I'm not sure, especially that the MLIR code that Polymer produces may contain negative boundaries as well.

https://github.com/kumasento/polymer/blob/61481d4a0dd5309226d0ba24db8091a205d7b44b/example/polybench/seidel-2d-debug/seidel-2d/seidel-2d.polymer.mlir#L5

which could finally result in very similar code as Pluto produces.

wsmoses commented 3 years ago

Yeah that's precisely the transformation that happens.

kumasento commented 3 years ago

I see, then I'm a bit skeptical that applying this transformation can further narrow the perf gap. As I mentioned above, the MLIR code produced by Polymer doesn't have that transformation applied neither. Maybe that transformation can improve the perf for both results, but it is not likely that their perf can become closer 😅

wsmoses commented 3 years ago

Oh apologies, misread the transformation a bit. The transformation only runs for scenarios with a negative step (not applied here).

So (and not checking off-by-one's here):

for (i = B; i >= E; i--)
  A[i] = B[i] + C[i];

for (i = 0; i < B-E; i ++) {
  idx = B-i;
  A[idx] = B[idx] + C[idx];
}

It's actually a bit more messy in a way that I'm trying to clean up now.

Positive step will remain the same.

kumasento / polymer