Performance regression since llvm 9 using pointers for array traversal

outpaddling commented 2 years ago

I wrote a simple benchmark using selection sort to compare performance of various compilers and interpreters:

https://github.com/outpaddling/Lang-speed

I noticed that clang 9 and later are much slower than clang 8 when using pointers to traverse the array. It is also much slower than the same program with subscripts. Clang 8 shows what I would expect, with pointers marginally faster than subscripts.

I see this issue on both FreeBSD (clang9 and later) and MacOS (clang13), though on MacOS the pointer version takes only about 1.5x as long as subscripts.

From the run times shown below, it looks like clang stopped doing some optimizations beyond -O starting at version 9. I don't see anything in the v9 release notes about this type of optimization. https://releases.llvm.org/9.0.0/docs/ReleaseNotes.html

To reproduce:

git clone https://github.com/outpaddling/Lang-speed.git cd Lang-speed (Edit clang-check to use available clang compilers) ./clang-check

clang version 8.0.1 (tags/RELEASE_801/final)
Target: x86_64-portbld-freebsd13.0
Thread model: posix
InstalledDir: /usr/local/llvm80/bin
-O
Subscripts:         2.10 real         2.09 user         0.00 sys
Pointers:           2.02 real         2.02 user         0.00 sys
-O2
Subscripts:         2.11 real         2.11 user         0.00 sys
Pointers:           2.02 real         2.02 user         0.00 sys
-Ofast
Subscripts:         2.36 real         2.35 user         0.00 sys
Pointers:           2.23 real         2.22 user         0.00 sys
clang version 9.0.1 
Target: x86_64-portbld-freebsd13.0
Thread model: posix
InstalledDir: /usr/local/llvm90/bin
-O
Subscripts:         2.24 real         2.23 user         0.00 sys
Pointers:           4.67 real         4.67 user         0.00 sys
-O2
Subscripts:         2.08 real         2.07 user         0.00 sys
Pointers:           4.64 real         4.64 user         0.00 sys
-Ofast
Subscripts:         2.03 real         2.03 user         0.00 sys
Pointers:           3.15 real         3.15 user         0.00 sys
clang version 13.0.0
Target: x86_64-portbld-freebsd13.0
Thread model: posix
InstalledDir: /usr/local/llvm13/bin
-O
Subscripts:         4.69 real         4.69 user         0.00 sys
Pointers:           4.65 real         4.64 user         0.00 sys
-O2
Subscripts:         2.10 real         2.10 user         0.00 sys
Pointers:           4.81 real         4.80 user         0.00 sys
-Ofast
Subscripts:         2.13 real         2.13 user         0.00 sys
Pointers:           4.74 real         4.73 user         0.00 sys

dwblaikie commented 2 years ago

@aeubanks - /maybe/ this could be some loss due to opaque pointers? (not sure if we have a group or how to tag a group here... )

outpaddling commented 2 years ago

I don't think so, it's a simple array that's part of a fully defined structure. The code is here in case someone wants a quick peek without cloning: https://github.com/outpaddling/Lang-speed/blob/master/selsort-pointers.c https://github.com/outpaddling/Lang-speed/blob/master/selsort-subscripts.c

outpaddling commented 2 months ago

Performance with subscripts has also regressed in more recent releases, to the point where GCC is now about 30% faster for selection sort with integers:

Sorting with Unix sort command...
User time = 0.15 seconds (.00250 minutes) (.00004 hours).
RSS = 15452 KB

Sorting int array with clang18 and subscripts...
User time = 4.60 seconds (.07666 minutes) (.00127 hours).
RSS = 2852 KB

Sorting long array with clang18 and subscripts...
User time = 4.60 seconds (.07666 minutes) (.00127 hours).
RSS = 3240 KB

Sorting float array with clang18 and subscripts...
User time = 4.61 seconds (.07683 minutes) (.00128 hours).
RSS = 2848 KB

Sorting double array with clang18 and subscripts...
User time = 4.62 seconds (.07700 minutes) (.00128 hours).
RSS = 3240 KB

Sorting int array with clang18 and pointers...
User time = 4.66 seconds (.07766 minutes) (.00129 hours).
RSS = 2848 KB

Sorting long array with clang18 and pointers...
User time = 4.67 seconds (.07783 minutes) (.00129 hours).
RSS = 3240 KB

Sorting float array with clang18 and pointers...
User time = 6.12 seconds (.10200 minutes) (.00170 hours).
RSS = 2852 KB

Sorting double array with clang18 and pointers...
User time = 6.12 seconds (.10200 minutes) (.00170 hours).
RSS = 3240 KB

Sorting int array with gcc14 and subscripts...
User time = 3.10 seconds (.05166 minutes) (.00086 hours).
RSS = 2840 KB

Sorting long array with gcc14 and subscripts...
User time = 3.13 seconds (.05216 minutes) (.00086 hours).
RSS = 3228 KB

Sorting float array with gcc14 and subscripts...
User time = 3.12 seconds (.05200 minutes) (.00086 hours).
RSS = 2844 KB

Sorting double array with gcc14 and subscripts...
User time = 3.13 seconds (.05216 minutes) (.00086 hours).
RSS = 3236 KB

Sorting int array with gcc14 and pointers...
User time = 3.08 seconds (.05133 minutes) (.00085 hours).
RSS = 2840 KB

Sorting long array with gcc14 and pointers...
User time = 3.08 seconds (.05133 minutes) (.00085 hours).
RSS = 3228 KB

Sorting float array with gcc14 and pointers...
User time = 5.80 seconds (.09666 minutes) (.00161 hours).
RSS = 2844 KB

Sorting double array with gcc14 and pointers...
User time = 5.80 seconds (.09666 minutes) (.00161 hours).
RSS = 3232 KB

nikic commented 2 months ago

Godbolt: https://c.godbolt.org/z/Y4Y3ozs37

It looks like the difference is in runtime unrolling. Clang 9 unrolled both variants, clang 13 unrolled the pointer variant, current clang unrolls neither.

These are far enough back that I don't know whether the changes in unrolling behavior were intentional or incidental.

nikic commented 2 months ago

The loops don't get expanded because the trip count fails the high cost expansion check.

The trip counts are:

subscript: {(-1 + %0),+,-1}<nw><%for.body>
pointer: (1 + (({(-5 + (-1 * (ptrtoint ptr %0 to i64))),+,-4}<nw><%for.body> + (((4 * %1) + (ptrtoint ptr %0 to i64)) umax {(8 + (ptrtoint ptr %0 to i64)),+,4}<nw><%for.body>)) /u 4))<nuw><nsw>

The cost is probably over-estimated here, because (at least in final codegen) we wouldn't actually have a dedicated separate IV and just treat it as an offset of a canonical IV.

nikic commented 2 months ago

Something like this allows the subscript variant to unroll:

diff --git a/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp b/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
index c7d758aa575e..1d249e84782a 100644
--- a/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
+++ b/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
@@ -2044,8 +2044,24 @@ bool SCEVExpander::isHighCostExpansionHelper(
     return Cost > Budget;
   }
   case scAddRecExpr: {
-    assert(cast<SCEVAddRecExpr>(S)->getNumOperands() >= 2 &&
-           "Polynomial should be at least linear");
+    auto *AR = cast<SCEVAddRecExpr>(S);
+    if (CanonicalMode && AR->isAffine()) {
+      PHINode *CanonicalIV = AR->getLoop()->getCanonicalInductionVariable();
+      if (CanonicalIV && CanonicalIV->getType() == AR->getType()) {
+        // The addrec will be emitted as %canonal.iv * %step + %start.
+        if (!AR->getOperand(1)->isOne()) {
+          Cost += TTI.getArithmeticInstrCost(Instruction::Mul, AR->getType(),
+                                             CostKind);
+          Worklist.emplace_back(Instruction::Mul, 1, AR->getOperand(1));
+        }
+        Cost += TTI.getArithmeticInstrCost(Instruction::Add, AR->getType(),
+                                           CostKind);
+        Worklist.emplace_back(Instruction::Add, 1, AR->getStart());
+        return Cost > Budget;
+      }
+    }
+
+    assert(AR->getNumOperands() >= 2 && "Polynomial should be at least linear");
     Cost += costAndCollectOperands<SCEVAddRecExpr>(
         WorkItem, TTI, CostKind, Worklist);
     return Cost > Budget;

I was going to submit that, but then I looked at how AddRec's are normally costed and ... it just doesn't make any sense. For {(-1 + %0),+,-1} what we're going to generate are two adds and a phi, while the cost model seems to think it will result in an add and two multiplies. I think if that gets fixed, we probably don't need to check for canonical IVs. I'll try to take a look tomorrow.

outpaddling commented 2 months ago

The loops don't get expanded because the trip count fails the high cost expansion check.

The trip counts are:
subscript: {(-1 + %0),+,-1}<nw><%for.body>
pointer: (1 + (({(-5 + (-1 * (ptrtoint ptr %0 to i64))),+,-4}<nw><%for.body> + (((4 * %1) + (ptrtoint ptr %0 to i64)) umax {(8 + (ptrtoint ptr %0 to i64)),+,4}<nw><%for.body>)) /u 4))<nuw><nsw>
The cost is probably over-estimated here, because (at least in final codegen) we wouldn't actually have a dedicated separate IV and just treat it as an offset of a canonical IV.

If that's the case, shouldn't adding -funroll-loops serve as a workaround? I tried and it doesn't change the results.

Compiling with clang18 -O2 -funroll-loops -march=native, subscripts, int...
...
Sorting with Unix sort command...
User time = 0.14 seconds (.00233 minutes) (.00003 hours).
RSS = 15528 KB

Sorting int array with clang18 and subscripts...
User time = 4.68 seconds (.07800 minutes) (.00130 hours).
RSS = 2844 KB

outpaddling commented 2 months ago

Godbolt: https://c.godbolt.org/z/Y4Y3ozs37

It looks like the difference is in runtime unrolling. Clang 9 unrolled both variants, clang 13 unrolled the pointer variant, current clang unrolls neither.

These are far enough back that I don't know whether the changes in unrolling behavior were intentional or incidental.

This info doesn't seem to match the results in my original comment above. They show that clang9 regressed for pointers, but not subscripts. Clang13 regressed for both with -O, but only regressed for pointers if -O2 or -Ofast was used. So subscripts were faster with clang13 and aggressive optimizations.

nikic commented 2 months ago

If that's the case, shouldn't adding -funroll-loops serve as a workaround? I tried and it doesn't change the results.

-funroll-loops just controls whether unrolling is enabled, it does not influence the cost model. You'd need something like -mllvm -scev-cheap-expansion-budget=6 to get the subscript variant to unroll and -mllvm -scev-cheap-expansion-budget=20 (or maybe more) for the pointer variant.

outpaddling commented 2 months ago

If that's the case, shouldn't adding -funroll-loops serve as a workaround? I tried and it doesn't change the results.

-funroll-loops just controls whether unrolling is enabled, it does not influence the cost model. You'd need something like -mllvm -scev-cheap-expansion-budget=6 to get the subscript variant to unroll and -mllvm -scev-cheap-expansion-budget=20 (or maybe more) for the pointer variant.

Confirmed. -mllvm -scev-cheap-expansion-budget=20 didn't do it for pointers, but -mllvm -scev-cheap-expansion-budget=30 did.

Results from a slightly different machine, so cannot be compared to other comments:

Sorting with Unix sort command...
User time = 0.19 seconds (.00316 minutes) (.00005 hours).
RSS = 15516 KB

Sorting int array with clang18 and subscripts...
User time = 2.06 seconds (.03433 minutes) (.00057 hours).
RSS = 2852 KB

Sorting long array with clang18 and subscripts...
User time = 2.08 seconds (.03466 minutes) (.00057 hours).
RSS = 3236 KB

Without -mllvm -scev-cheap-expansion-budget=30 :

Sorting int array with clang18 and subscripts...
User time = 4.82 seconds (.08033 minutes) (.00133 hours).
RSS = 2848 KB

Sorting long array with clang18 and subscripts...
User time = 4.86 seconds (.08100 minutes) (.00135 hours).
RSS = 3240 KB

outpaddling commented 1 month ago

Thanks for your work on this...

I updated the language comparison scripts to output in a much more condensed and readable format. Some oddities that stand out:

GCC is much faster on the Intel core processors I tests, while clang is faster or about the same on the AMDs.
Both Clang and GCC show poor performance with pointers to floating points on everything except the Opteron. Why would subscripts be faster with floating point, but not with integers? Programs were compiled with only -O2 on all systems, so the binaries should be the same.

Full results are at https://github.com/outpaddling/Lang-speed/tree/master/Results

Intel(R) Core(TM) i5-3470T CPU @ 2.90GHz

Language             Type       Access        Seconds  Minutes  Hours    RSS
unix-sort            int        -                0.14      0.0    0.0  15516k
clang18              int        subscripts       4.61      0.1    0.0   2848k
clang18              long       subscripts       4.63      0.1    0.0   3240k
clang18              float      subscripts       4.64      0.1    0.0   2848k
clang18              double     subscripts       4.63      0.1    0.0   3232k
clang18              int        pointers         4.68      0.1    0.0   2848k
clang18              long       pointers         4.68      0.1    0.0   3240k
clang18              float      pointers         6.11      0.1    0.0   2852k
clang18              double     pointers         6.10      0.1    0.0   3240k
gcc14                int        subscripts       3.10      0.1    0.0   2840k
gcc14                long       subscripts       3.12      0.1    0.0   3232k
gcc14                float      subscripts       3.13      0.1    0.0   2844k
gcc14                double     subscripts       3.16      0.1    0.0   3232k
gcc14                int        pointers         3.09      0.1    0.0   2844k
gcc14                long       pointers         3.09      0.1    0.0   3232k
gcc14                float      pointers         5.82      0.1    0.0   2844k
gcc14                double     pointers         5.82      0.1    0.0   3232k

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

clang18              int        subscripts       4.09      0.1    0.0   2272k
clang18              long       subscripts       4.10      0.1    0.0   2664k
clang18              float      subscripts       4.12      0.1    0.0   2272k
clang18              double     subscripts       4.11      0.1    0.0   2664k
clang18              int        pointers         4.16      0.1    0.0   2272k
clang18              long       pointers         4.17      0.1    0.0   2660k
clang18              float      pointers         5.44      0.1    0.0   2276k
clang18              double     pointers         5.43      0.1    0.0   2664k
gcc14                int        subscripts       2.75      0.0    0.0   2268k
gcc14                long       subscripts       2.79      0.0    0.0   2656k
gcc14                float      subscripts       2.78      0.0    0.0   2264k
gcc14                double     subscripts       2.80      0.0    0.0   2656k
gcc14                int        pointers         2.75      0.0    0.0   2268k
gcc14                long       pointers         2.75      0.0    0.0   2660k
gcc14                float      pointers         5.18      0.1    0.0   2268k
gcc14                double     pointers         5.18      0.1    0.0   2652k

AMD Opteron(tm) Processor 4386                  (3100.13-MHz K8-class CPU)

clang18              int        subscripts       8.98      0.1    0.0   2572k
clang18              long       subscripts       8.98      0.1    0.0   2956k
clang18              float      subscripts       9.02      0.2    0.0   2568k
clang18              double     subscripts       7.61      0.1    0.0   2960k
clang18              int        pointers         4.66      0.1    0.0   2572k
clang18              long       pointers         6.59      0.1    0.0   2956k
clang18              float      pointers         4.05      0.1    0.0   2568k
clang18              double     pointers         4.97      0.1    0.0   2960k
gcc14                int        subscripts       7.93      0.1    0.0   2556k
gcc14                long       subscripts       8.08      0.1    0.0   2952k
gcc14                float      subscripts       8.12      0.1    0.0   2564k
gcc14                double     subscripts       8.12      0.1    0.0   2952k
gcc14                int        pointers         6.45      0.1    0.0   2564k
gcc14                long       pointers         6.45      0.1    0.0   2952k
gcc14                float      pointers         3.88      0.1    0.0   2560k
gcc14                double     pointers         3.96      0.1    0.0   2952k

AMD Phenom(tm) II X4 955 Processor (3214.34-MHz K8-class CPU)

clang18              int        subscripts       6.30      0.1    0.0   2852k
clang18              long       subscripts       6.39      0.1    0.0   3240k
clang18              float      subscripts       6.35      0.1    0.0   2852k
clang18              double     subscripts       6.47      0.1    0.0   3244k
clang18              int        pointers         4.74      0.1    0.0   2856k
clang18              long       pointers         4.88      0.1    0.0   3248k
clang18              float      pointers        11.52      0.2    0.0   2852k
clang18              double     pointers        11.55      0.2    0.0   3248k
gcc14                int        subscripts       7.86      0.1    0.0   2840k
gcc14                long       subscripts       7.96      0.1    0.0   3240k
gcc14                float      subscripts       7.89      0.1    0.0   2844k
gcc14                double     subscripts       7.99      0.1    0.0   3236k
gcc14                int        pointers         6.69      0.1    0.0   2844k
gcc14                long       pointers         6.45      0.1    0.0   3232k
gcc14                float      pointers        11.50      0.2    0.0   2848k
gcc14                double     pointers        11.51      0.2    0.0   3236k

llvm / llvm-project

Performance regression since llvm 9 using pointers for array traversal #53205