Loop flattening not performed?

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

28.69k stars 11.87k forks source link

Loop flattening not performed? #39927

Open LebedevRI opened 5 years ago

LebedevRI commented 5 years ago


Bugzilla Link	40581
Version	trunk
OS	Linux
CC	@davidbolvansky,@DMG862,@fhahn,@hfinkel,@sjoerdmeijer,@rotateright

Extended Description

https://godbolt.org/z/B3VM2A

void v0(int n, int A, int B) { int k = 0; for(int i = 0; i < n; i++) for(int j = 0; j < n; j++) { A[k] = B[k]; k++; } }

is equivalent to

void v1(int n, int A, int B) { for(int i = 0; i < n; i++) for(int j = 0; j < n; j++) { int k = i*n+j; A[k] = B[k]; k++; } }

(Also, which one of these ^ is better? clang does not optimize them into the same IR)

Shouldn't that be transformed into

void v2(int n, int A, int B) { for(int k = 0; k < n * n; k++) A[k] = B[k]; }

sjoerdmeijer commented 3 years ago

Yes, coremark needs loop flatten and jump threading for FSM (https://reviews.llvm.org/D99205). Then the perf should match gcc.

LoopFlatten - what are you plans with this pass? Are you going to enable it by default?

Yeah, I think we should get this enabled by default. While the implementation is quite focused on recognising a particular form, this does not e.g. only trigger in CoreMark and I have seen it triggering in different code bases too, which is why I think we should get this enabled. In fact, downstream we do have this enabled by default. I am aware of 1 issue though which we need fix first. I will try to dig up this case this week and raise a ticket for that.

davidbolvansky commented 3 years ago

Yes, coremark needs loop flatten and jump threading for FSM (https://reviews.llvm.org/D99205). Then the perf should match gcc.

LoopFlatten - what are you plans with this pass? Are you going to enable it by default?

sjoerdmeijer commented 3 years ago

In CoreMark, gcc 11 is about 18% faster than clang 12. https://www.phoronix.com/scan.php?page=article&item=clang12-gcc11- icelake&num=4

LoopFlatten was committed October 2020. I have also added widening of the induction variable. However, LoopFlatten is not yet enabled by default (can be enabled on the command-line).

Besides this, there are probably other reasons why we are behind. Jump-threading must be another reason, this is in review in https://reviews.llvm.org/D99205. Loopunroll-and-jam might also help, and is also not enabled by default.

f1376df8-34bc-4756-9be6-f8bc6a69b887 commented 3 years ago

In CoreMark, gcc 11 is about 18% faster than clang 12. https://www.phoronix.com/scan.php?page=article&item=clang12-gcc11-icelake&num=4

4c958a03-4ad7-407c-a85b-320d9a7221b0 commented 4 years ago

Yeah, it looks like that case should be OK from the A[i*N+j] use of an inbounds gep.

The widening sounds interesting, but on the architecture I work on at the moment making i64 induction variables would be much slower than i32's. Much more than would be gained by flattening the loop unless it started to vectorize a lot better. So it's not something that the pass can do on it's own, but if it happens naturally for the architecture it should just work providing you can get past all the phase ordering issues.

davidbolvansky commented 4 years ago

https://github.com/eembc/coremark/blob/master/core_matrix.c#L242

these Coremark’s loops could be improved, or?

And instead of overflow check, maybe just use wider induction variable (if possible)?

4c958a03-4ad7-407c-a85b-320d9a7221b0 commented 5 years ago

https://reviews.llvm.org/D42365. Written by Oliver S. I'm not sure it will handle case 1 without some help.

I never pushed it very hard, as it very rarely comes up in any benchmarks I ran. I think GCC used to have a similar pass that they removed (I may be wrong, can't find anything online about it now). And Polly has/had a flattening schedule used for testing. Perhaps it doesn't need to be it's own pass, and it's something we teach loop-simplify-cfg to do? Not sure.

In your case, we would have to prove that n * n doesn't overflow. Or you can version the loop with a runtime check, but that doesn't sound very worth-while to me.