Open LebedevRI opened 5 years ago
Yes, coremark needs loop flatten and jump threading for FSM (https://reviews.llvm.org/D99205). Then the perf should match gcc.
LoopFlatten - what are you plans with this pass? Are you going to enable it by default?
Yeah, I think we should get this enabled by default. While the implementation is quite focused on recognising a particular form, this does not e.g. only trigger in CoreMark and I have seen it triggering in different code bases too, which is why I think we should get this enabled. In fact, downstream we do have this enabled by default. I am aware of 1 issue though which we need fix first. I will try to dig up this case this week and raise a ticket for that.
Yes, coremark needs loop flatten and jump threading for FSM (https://reviews.llvm.org/D99205). Then the perf should match gcc.
LoopFlatten - what are you plans with this pass? Are you going to enable it by default?
In CoreMark, gcc 11 is about 18% faster than clang 12. https://www.phoronix.com/scan.php?page=article&item=clang12-gcc11- icelake&num=4
LoopFlatten was committed October 2020. I have also added widening of the induction variable. However, LoopFlatten is not yet enabled by default (can be enabled on the command-line).
Besides this, there are probably other reasons why we are behind. Jump-threading must be another reason, this is in review in https://reviews.llvm.org/D99205. Loopunroll-and-jam might also help, and is also not enabled by default.
In CoreMark, gcc 11 is about 18% faster than clang 12. https://www.phoronix.com/scan.php?page=article&item=clang12-gcc11-icelake&num=4
Yeah, it looks like that case should be OK from the A[i*N+j] use of an inbounds gep.
The widening sounds interesting, but on the architecture I work on at the moment making i64 induction variables would be much slower than i32's. Much more than would be gained by flattening the loop unless it started to vectorize a lot better. So it's not something that the pass can do on it's own, but if it happens naturally for the architecture it should just work providing you can get past all the phase ordering issues.
https://github.com/eembc/coremark/blob/master/core_matrix.c#L242
these Coremark’s loops could be improved, or?
And instead of overflow check, maybe just use wider induction variable (if possible)?
https://reviews.llvm.org/D42365. Written by Oliver S. I'm not sure it will handle case 1 without some help.
I never pushed it very hard, as it very rarely comes up in any benchmarks I ran. I think GCC used to have a similar pass that they removed (I may be wrong, can't find anything online about it now). And Polly has/had a flattening schedule used for testing. Perhaps it doesn't need to be it's own pass, and it's something we teach loop-simplify-cfg to do? Not sure.
In your case, we would have to prove that n * n doesn't overflow. Or you can version the loop with a runtime check, but that doesn't sound very worth-while to me.
Extended Description
https://godbolt.org/z/B3VM2A
void v0(int n, int A, int B) { int k = 0; for(int i = 0; i < n; i++) for(int j = 0; j < n; j++) { A[k] = B[k]; k++; } }
is equivalent to
void v1(int n, int A, int B) { for(int i = 0; i < n; i++) for(int j = 0; j < n; j++) { int k = i*n+j; A[k] = B[k]; k++; } }
(Also, which one of these ^ is better? clang does not optimize them into the same IR)
Shouldn't that be transformed into
void v2(int n, int A, int B) { for(int k = 0; k < n * n; k++) A[k] = B[k]; }
?