Open outpaddling opened 2 years ago
@aeubanks - /maybe/ this could be some loss due to opaque pointers? (not sure if we have a group or how to tag a group here... )
I don't think so, it's a simple array that's part of a fully defined structure. The code is here in case someone wants a quick peek without cloning: https://github.com/outpaddling/Lang-speed/blob/master/selsort-pointers.c https://github.com/outpaddling/Lang-speed/blob/master/selsort-subscripts.c
Performance with subscripts has also regressed in more recent releases, to the point where GCC is now about 30% faster for selection sort with integers:
Sorting with Unix sort command...
User time = 0.15 seconds (.00250 minutes) (.00004 hours).
RSS = 15452 KB
Sorting int array with clang18 and subscripts...
User time = 4.60 seconds (.07666 minutes) (.00127 hours).
RSS = 2852 KB
Sorting long array with clang18 and subscripts...
User time = 4.60 seconds (.07666 minutes) (.00127 hours).
RSS = 3240 KB
Sorting float array with clang18 and subscripts...
User time = 4.61 seconds (.07683 minutes) (.00128 hours).
RSS = 2848 KB
Sorting double array with clang18 and subscripts...
User time = 4.62 seconds (.07700 minutes) (.00128 hours).
RSS = 3240 KB
Sorting int array with clang18 and pointers...
User time = 4.66 seconds (.07766 minutes) (.00129 hours).
RSS = 2848 KB
Sorting long array with clang18 and pointers...
User time = 4.67 seconds (.07783 minutes) (.00129 hours).
RSS = 3240 KB
Sorting float array with clang18 and pointers...
User time = 6.12 seconds (.10200 minutes) (.00170 hours).
RSS = 2852 KB
Sorting double array with clang18 and pointers...
User time = 6.12 seconds (.10200 minutes) (.00170 hours).
RSS = 3240 KB
Sorting int array with gcc14 and subscripts...
User time = 3.10 seconds (.05166 minutes) (.00086 hours).
RSS = 2840 KB
Sorting long array with gcc14 and subscripts...
User time = 3.13 seconds (.05216 minutes) (.00086 hours).
RSS = 3228 KB
Sorting float array with gcc14 and subscripts...
User time = 3.12 seconds (.05200 minutes) (.00086 hours).
RSS = 2844 KB
Sorting double array with gcc14 and subscripts...
User time = 3.13 seconds (.05216 minutes) (.00086 hours).
RSS = 3236 KB
Sorting int array with gcc14 and pointers...
User time = 3.08 seconds (.05133 minutes) (.00085 hours).
RSS = 2840 KB
Sorting long array with gcc14 and pointers...
User time = 3.08 seconds (.05133 minutes) (.00085 hours).
RSS = 3228 KB
Sorting float array with gcc14 and pointers...
User time = 5.80 seconds (.09666 minutes) (.00161 hours).
RSS = 2844 KB
Sorting double array with gcc14 and pointers...
User time = 5.80 seconds (.09666 minutes) (.00161 hours).
RSS = 3232 KB
Godbolt: https://c.godbolt.org/z/Y4Y3ozs37
It looks like the difference is in runtime unrolling. Clang 9 unrolled both variants, clang 13 unrolled the pointer variant, current clang unrolls neither.
These are far enough back that I don't know whether the changes in unrolling behavior were intentional or incidental.
The loops don't get expanded because the trip count fails the high cost expansion check.
The trip counts are:
subscript: {(-1 + %0),+,-1}<nw><%for.body>
pointer: (1 + (({(-5 + (-1 * (ptrtoint ptr %0 to i64))),+,-4}<nw><%for.body> + (((4 * %1) + (ptrtoint ptr %0 to i64)) umax {(8 + (ptrtoint ptr %0 to i64)),+,4}<nw><%for.body>)) /u 4))<nuw><nsw>
The cost is probably over-estimated here, because (at least in final codegen) we wouldn't actually have a dedicated separate IV and just treat it as an offset of a canonical IV.
Something like this allows the subscript variant to unroll:
diff --git a/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp b/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
index c7d758aa575e..1d249e84782a 100644
--- a/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
+++ b/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
@@ -2044,8 +2044,24 @@ bool SCEVExpander::isHighCostExpansionHelper(
return Cost > Budget;
}
case scAddRecExpr: {
- assert(cast<SCEVAddRecExpr>(S)->getNumOperands() >= 2 &&
- "Polynomial should be at least linear");
+ auto *AR = cast<SCEVAddRecExpr>(S);
+ if (CanonicalMode && AR->isAffine()) {
+ PHINode *CanonicalIV = AR->getLoop()->getCanonicalInductionVariable();
+ if (CanonicalIV && CanonicalIV->getType() == AR->getType()) {
+ // The addrec will be emitted as %canonal.iv * %step + %start.
+ if (!AR->getOperand(1)->isOne()) {
+ Cost += TTI.getArithmeticInstrCost(Instruction::Mul, AR->getType(),
+ CostKind);
+ Worklist.emplace_back(Instruction::Mul, 1, AR->getOperand(1));
+ }
+ Cost += TTI.getArithmeticInstrCost(Instruction::Add, AR->getType(),
+ CostKind);
+ Worklist.emplace_back(Instruction::Add, 1, AR->getStart());
+ return Cost > Budget;
+ }
+ }
+
+ assert(AR->getNumOperands() >= 2 && "Polynomial should be at least linear");
Cost += costAndCollectOperands<SCEVAddRecExpr>(
WorkItem, TTI, CostKind, Worklist);
return Cost > Budget;
I was going to submit that, but then I looked at how AddRec's are normally costed and ... it just doesn't make any sense. For {(-1 + %0),+,-1}
what we're going to generate are two adds and a phi, while the cost model seems to think it will result in an add and two multiplies. I think if that gets fixed, we probably don't need to check for canonical IVs. I'll try to take a look tomorrow.
The loops don't get expanded because the trip count fails the high cost expansion check.
The trip counts are:
subscript: {(-1 + %0),+,-1}<nw><%for.body> pointer: (1 + (({(-5 + (-1 * (ptrtoint ptr %0 to i64))),+,-4}<nw><%for.body> + (((4 * %1) + (ptrtoint ptr %0 to i64)) umax {(8 + (ptrtoint ptr %0 to i64)),+,4}<nw><%for.body>)) /u 4))<nuw><nsw>
The cost is probably over-estimated here, because (at least in final codegen) we wouldn't actually have a dedicated separate IV and just treat it as an offset of a canonical IV.
If that's the case, shouldn't adding -funroll-loops serve as a workaround? I tried and it doesn't change the results.
Compiling with clang18 -O2 -funroll-loops -march=native, subscripts, int...
...
Sorting with Unix sort command...
User time = 0.14 seconds (.00233 minutes) (.00003 hours).
RSS = 15528 KB
Sorting int array with clang18 and subscripts...
User time = 4.68 seconds (.07800 minutes) (.00130 hours).
RSS = 2844 KB
Godbolt: https://c.godbolt.org/z/Y4Y3ozs37
It looks like the difference is in runtime unrolling. Clang 9 unrolled both variants, clang 13 unrolled the pointer variant, current clang unrolls neither.
These are far enough back that I don't know whether the changes in unrolling behavior were intentional or incidental.
This info doesn't seem to match the results in my original comment above. They show that clang9 regressed for pointers, but not subscripts. Clang13 regressed for both with -O, but only regressed for pointers if -O2 or -Ofast was used. So subscripts were faster with clang13 and aggressive optimizations.
If that's the case, shouldn't adding -funroll-loops serve as a workaround? I tried and it doesn't change the results.
-funroll-loops
just controls whether unrolling is enabled, it does not influence the cost model. You'd need something like -mllvm -scev-cheap-expansion-budget=6
to get the subscript variant to unroll and -mllvm -scev-cheap-expansion-budget=20
(or maybe more) for the pointer variant.
If that's the case, shouldn't adding -funroll-loops serve as a workaround? I tried and it doesn't change the results.
-funroll-loops
just controls whether unrolling is enabled, it does not influence the cost model. You'd need something like-mllvm -scev-cheap-expansion-budget=6
to get the subscript variant to unroll and-mllvm -scev-cheap-expansion-budget=20
(or maybe more) for the pointer variant.
Confirmed. -mllvm -scev-cheap-expansion-budget=20 didn't do it for pointers, but -mllvm -scev-cheap-expansion-budget=30 did.
Results from a slightly different machine, so cannot be compared to other comments:
Sorting with Unix sort command...
User time = 0.19 seconds (.00316 minutes) (.00005 hours).
RSS = 15516 KB
Sorting int array with clang18 and subscripts...
User time = 2.06 seconds (.03433 minutes) (.00057 hours).
RSS = 2852 KB
Sorting long array with clang18 and subscripts...
User time = 2.08 seconds (.03466 minutes) (.00057 hours).
RSS = 3236 KB
Without -mllvm -scev-cheap-expansion-budget=30 :
Sorting int array with clang18 and subscripts...
User time = 4.82 seconds (.08033 minutes) (.00133 hours).
RSS = 2848 KB
Sorting long array with clang18 and subscripts...
User time = 4.86 seconds (.08100 minutes) (.00135 hours).
RSS = 3240 KB
Thanks for your work on this...
I updated the language comparison scripts to output in a much more condensed and readable format. Some oddities that stand out:
Full results are at https://github.com/outpaddling/Lang-speed/tree/master/Results
Intel(R) Core(TM) i5-3470T CPU @ 2.90GHz
Language Type Access Seconds Minutes Hours RSS
unix-sort int - 0.14 0.0 0.0 15516k
clang18 int subscripts 4.61 0.1 0.0 2848k
clang18 long subscripts 4.63 0.1 0.0 3240k
clang18 float subscripts 4.64 0.1 0.0 2848k
clang18 double subscripts 4.63 0.1 0.0 3232k
clang18 int pointers 4.68 0.1 0.0 2848k
clang18 long pointers 4.68 0.1 0.0 3240k
clang18 float pointers 6.11 0.1 0.0 2852k
clang18 double pointers 6.10 0.1 0.0 3240k
gcc14 int subscripts 3.10 0.1 0.0 2840k
gcc14 long subscripts 3.12 0.1 0.0 3232k
gcc14 float subscripts 3.13 0.1 0.0 2844k
gcc14 double subscripts 3.16 0.1 0.0 3232k
gcc14 int pointers 3.09 0.1 0.0 2844k
gcc14 long pointers 3.09 0.1 0.0 3232k
gcc14 float pointers 5.82 0.1 0.0 2844k
gcc14 double pointers 5.82 0.1 0.0 3232k
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
clang18 int subscripts 4.09 0.1 0.0 2272k
clang18 long subscripts 4.10 0.1 0.0 2664k
clang18 float subscripts 4.12 0.1 0.0 2272k
clang18 double subscripts 4.11 0.1 0.0 2664k
clang18 int pointers 4.16 0.1 0.0 2272k
clang18 long pointers 4.17 0.1 0.0 2660k
clang18 float pointers 5.44 0.1 0.0 2276k
clang18 double pointers 5.43 0.1 0.0 2664k
gcc14 int subscripts 2.75 0.0 0.0 2268k
gcc14 long subscripts 2.79 0.0 0.0 2656k
gcc14 float subscripts 2.78 0.0 0.0 2264k
gcc14 double subscripts 2.80 0.0 0.0 2656k
gcc14 int pointers 2.75 0.0 0.0 2268k
gcc14 long pointers 2.75 0.0 0.0 2660k
gcc14 float pointers 5.18 0.1 0.0 2268k
gcc14 double pointers 5.18 0.1 0.0 2652k
AMD Opteron(tm) Processor 4386 (3100.13-MHz K8-class CPU)
clang18 int subscripts 8.98 0.1 0.0 2572k
clang18 long subscripts 8.98 0.1 0.0 2956k
clang18 float subscripts 9.02 0.2 0.0 2568k
clang18 double subscripts 7.61 0.1 0.0 2960k
clang18 int pointers 4.66 0.1 0.0 2572k
clang18 long pointers 6.59 0.1 0.0 2956k
clang18 float pointers 4.05 0.1 0.0 2568k
clang18 double pointers 4.97 0.1 0.0 2960k
gcc14 int subscripts 7.93 0.1 0.0 2556k
gcc14 long subscripts 8.08 0.1 0.0 2952k
gcc14 float subscripts 8.12 0.1 0.0 2564k
gcc14 double subscripts 8.12 0.1 0.0 2952k
gcc14 int pointers 6.45 0.1 0.0 2564k
gcc14 long pointers 6.45 0.1 0.0 2952k
gcc14 float pointers 3.88 0.1 0.0 2560k
gcc14 double pointers 3.96 0.1 0.0 2952k
AMD Phenom(tm) II X4 955 Processor (3214.34-MHz K8-class CPU)
clang18 int subscripts 6.30 0.1 0.0 2852k
clang18 long subscripts 6.39 0.1 0.0 3240k
clang18 float subscripts 6.35 0.1 0.0 2852k
clang18 double subscripts 6.47 0.1 0.0 3244k
clang18 int pointers 4.74 0.1 0.0 2856k
clang18 long pointers 4.88 0.1 0.0 3248k
clang18 float pointers 11.52 0.2 0.0 2852k
clang18 double pointers 11.55 0.2 0.0 3248k
gcc14 int subscripts 7.86 0.1 0.0 2840k
gcc14 long subscripts 7.96 0.1 0.0 3240k
gcc14 float subscripts 7.89 0.1 0.0 2844k
gcc14 double subscripts 7.99 0.1 0.0 3236k
gcc14 int pointers 6.69 0.1 0.0 2844k
gcc14 long pointers 6.45 0.1 0.0 3232k
gcc14 float pointers 11.50 0.2 0.0 2848k
gcc14 double pointers 11.51 0.2 0.0 3236k
I wrote a simple benchmark using selection sort to compare performance of various compilers and interpreters:
https://github.com/outpaddling/Lang-speed
I noticed that clang 9 and later are much slower than clang 8 when using pointers to traverse the array. It is also much slower than the same program with subscripts. Clang 8 shows what I would expect, with pointers marginally faster than subscripts.
I see this issue on both FreeBSD (clang9 and later) and MacOS (clang13), though on MacOS the pointer version takes only about 1.5x as long as subscripts.
From the run times shown below, it looks like clang stopped doing some optimizations beyond -O starting at version 9. I don't see anything in the v9 release notes about this type of optimization. https://releases.llvm.org/9.0.0/docs/ReleaseNotes.html
To reproduce:
git clone https://github.com/outpaddling/Lang-speed.git cd Lang-speed (Edit clang-check to use available clang compilers) ./clang-check