Open llvmbot opened 10 years ago
Sorry, I haven't looked into this case, but I want to point out that the problem may be in the DAG builder, not the scheduler itself.
On thing that mutates the DAG is load clustering. You can try -misched-cluster=false to see of that's the problem.
I agree with Quentin's guess that the scheduler should be able to handle it by reducing pressure but there might be some unexpected dependency that prevents the scheduler from correcting.
-debug-only=misched should show you the scheduling DAG, the ready queue at each step, and why the scheduler chose to schedule an instruction.
Note that the scheduler should detect high register pressure before it begins scheduling and use select heuristics in an attempt to reduce pressure. It is also possible that the heuristics are failing to handle this case. Unfortunately, I don't have time to look at it for the next couple weeks.
I completely agree that it's not a register allocation problem.
By "earlier in the pipeline" you mean the front-end? It seems a bit strange to me that it separates computation chains like this. Obviously, neither gcc nor icc do it. Should I file a bug against clang then?
With respect to misched, it seems to be an issue of size. If I halve the chained additions, it is actually doing a much better job.
Andy, any thoughts on misched?
Again, the frontend distills the chained summation, thereby significantly increasing the register pressure, and misched can't undo the damage.
I had a quick look into EvalStripped2.ll and it seems indeed, that we hoist the something0...somethingN before doing the computations. To be able to recover from this situation in the register allocator, we would need to be able to rematerialize the whole chain of dependency. This seems heroic to me and of little use in the general case. Therefore, I believe the proper fix would be either in misched or earlier in the pipeline.
There may be good reasons why misched cannot recover from this situation (maybe some kind of aliasing we need to honor?)
Vlad, could you check why misched cannot undo the damage? Andy would be your source of information if you need help.
That way, we could determine where is the proper place to fix that.
I'll have a closer look, at some point, of the register allocator to be sure I am not missing something.
Thanks Vlad!
I'll try to have a look this week.
I've added another file, which has about half of the pathological function. It's still misbehaving.
Note that reducing it further (half that function) does not help, llvm works really well with either half. It seems that certain threshold needs to be met for the massive spills to come.
In any case, stats for EvalStripped2.ll: baseline: Total st stores loads st loads moves alu mult popcnt 389 19 69 19 59 194 29 0
add 75
and 62
dec 1
imul 29 mov 85 movabs 22 movzbl 4 movzwl 2 pop 6 push 6 shr 49 sub 1
misched disabled: Total st stores loads st loads moves alu mult popcnt 396 21 69 21 62 194 29 0
add 75
and 62
dec 1
imul 29 mov 92 movabs 22 movzbl 4 movzwl 2 pop 6 push 6 shr 49 sub 1
join liveintervals disabled: Total st stores loads st loads moves alu mult popcnt 408 18 69 18 80 194 29 0
add 75
and 62
dec 1
imul 29 mov 104 movabs 22 movzbl 4 movzwl 2 pop 6 push 6 shr 49 sub 1
both misched and join liveintervals disabled:
Total st stores loads st loads moves alu mult popcnt 413 20 69 20 81 194 29 0
add 75
and 62
dec 1
imul 29 mov 109 movabs 22 movzbl 4 movzwl 2 pop 6 push 6 shr 49 sub 1
Based on my visual examination of EvalStripped2.ll:
Again, the frontend distills the chained summation, thereby significantly increasing the register pressure, and misched can't undo the damage. Of course, this is my non-llvm-er opinion.
Alright, so I've added a file with just the function. The function is still the same (it is not simplified). While preparing it, I noticed something interesting.
Basically, at a higher level the function uses a lot of accumulator addition:
value += compute something 0 value += compute something 1 ... value += compute something N
return value
The frontend (which produces the .ll file) seems to completely disregard the ordering though. It calculates something0 to somethingN, and puts them into some IR registers. Then at the end it does all the chain additions.
%628 = add i32 %281, %276, !dbg !1207 %629 = add i32 %628, %287, !dbg !1212 %630 = add i32 %629, %370, !dbg !1178 %631 = add i32 %630, %375, !dbg !1183 %632 = add i32 %631, %381, !dbg !1135 %633 = add i32 %632, %459, !dbg !1140 %634 = add i32 %633, %464, !dbg !1079 %635 = add i32 %634, %470, !dbg !1084 %636 = add i32 %635, %506, !dbg !1121 %637 = add i32 %636, %511, !dbg !1164 %638 = add i32 %637, %517, !dbg !1193 %639 = add i32 %638, %611, !dbg !1226 %640 = add i32 %639, %607, !dbg !1231 %641 = add i32 %640, %615, !dbg !1250 %642 = add i32 %641, %619, !dbg !1236 %643 = add i32 %642, %623, !dbg !1241 %644 = add i32 %643, %627, !dbg !1245 ret i32 %644, !dbg !970
By distilling the chain addition from the actual computation, the frontend (TTBOMK) significantly increases the register pressure. I'm assuming that the misched pass would try to improve schedule the instructions and reduce the register pressure (which in this case seems to be the result of an artificial destruction by the frontend). Interestingly, turning off misched helps (though not as much as turning off join-liveintervals). Here are some updated stats (note that I have revised my methodology slightly, e.g. I'm counting an add %reg, (memory location) as a load, add and store).
Baseline, neither opt turned off:
Total st stores loads st loads moves alu mult popcnt 734 76 122 76 85 346 29 0
add 134
and 95
dec 1
imul 29 lea 8 mov 233 movabs 30 movzbl 8 movzwl 6 or 12 pop 6 push 6 sar 1 shl 6 shr 73 sub 2
misched turned off:
Total st stores loads st loads moves alu mult popcnt 693 45 122 45 106 346 29 0
add 134
and 95
dec 1
imul 29 lea 8 mov 192 movabs 31 movzbl 8 movzwl 6 or 12 pop 6 push 6 sar 1 shl 6 shr 73 sub 2
livejoin turned off:
Total st stores loads st loads moves alu mult popcnt 702 37 122 37 131 346 29 0
add 134
and 95
dec 1
imul 29 lea 8 mov 202 movabs 30 movzbl 8 movzwl 6 or 12 pop 6 push 6 sar 1 shl 6 shr 73 sub 2
Both livejoin and misched turned off:
Total st stores loads st loads moves alu mult popcnt 714 45 122 45 127 346 29 0
add 134
and 95
dec 1
imul 29 lea 8 mov 214 movabs 30 movzbl 8 movzwl 6 or 12 pop 6 push 6 sar 1 shl 6 shr 73 sub 2
The interaction between the two (livejoin and misched) is non-linear.
Ok, I'll file another report for a + 2*b not being LEA'd. It's a different issue indeed.
I'll make an attempt to simplify the function as well.
- there's an issue with join-liveintervals cause about 38 spills
Yes, this is a known issue with our splitting mechanism. llvm/llvm-project#19199 is another example where we hit that. http://llvm.org/bugs/show_bug.cgi?id=18825
- finally, a + 2 * b where a and b are unsigned 32 bit integers is not folded into a single lea instruction.
Is it worth filling another PR for this specific issue?
Regarding the register allocator problem, could reduce even more the test case? I.e., even if the reduced test case does not show such a huge difference in spill code between clang and gcc/icc, if we can work on something smaller but still showing a reasonably increase in spill, that would speed-up the process :).
Thanks.
Again, only CEvaluator::EvalMobs is of consequence. If need be, I can streamline the file to include that function and its inlined callees.
Andrew,
Per your request, I ran the misched thing. The output regarding register pressure on EvalMobs looks as follows:
** MI Scheduling **
_ZNK10CEvaluator8EvalMobsERK4Pos2jj:BB#0
From: DBG_VALUE %RDI, %noreg, !"this"; line no:0
To: RETQ %EAX
Note that you should be able to run llc on the .ll (gziped) file that I added, and I'm guessing that you'll obtain the same thing.
Ok, so I've identified another issue:
As I implied earlier, the function computes a lot of base-3 patterns. Most of these base-3 pattern calculations take the form of:
base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff]
base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2
Things work well up to the final summation. Essentially, gcc and icc generate an lea instruction to do the addition, e.g.:
439fad: mov 0x768880(,%rax,4),%eax 439fb4: mov 0x768880(,%rsi,4),%esi 439fbb: lea (%rax,%rsi,2),%r11d
However, llvm produces an extra add:
434755: mov 0x6ad540(%rcx),%ecx 43475b: add %ecx,%ecx 43475d: add 0x6ad540(%rdx),%ecx
True, the number of instructions is the same. However(!), the final add is in fact two micro-ops (see Agner's instruction tables http://agner.org/optimize/instruction_tables.pdf , page 185; it's not just Haswell though).
This explains the discrepancy in ALU counts almost entirely, as I'm counting an instruction such as add 0x6ad540(%rdx),%ecx as a load and an add.
To summarize:
there's an issue with join-liveintervals cause about 38 spills
there's another issue with the rest 37 spills.
finally, a + 2 * b where a and b are unsigned 32 bit integers is not folded into a single lea instruction.
Oh yeah. Hidden flags like -debug-only are not available in the Release build. You (or someone else) needs to run with Release+Asserts or Debug build. If you ping me next week I can probably do it for you.
Thanks for the comments!
Passing down (-mllvm ) -join-liveintervals=false does help:
Total st stores loads st loads moves alu mult popcnt 694 37 122 37 147 322 28 1
Sure, there's more moves, but those should be cheaper than stores and loads (moves may even be handled entirely in the frontend with recent processors)
Andrew, I passed the following param: -mllvm -debug-only=misched
$ /usr/bin/clang++-3.5 -D_M_AMD64 -m64 -msse4 -O3 -std=c++0x -g3 -Wno-unused-result -Wno-deprecated -c /home/vlad/g2/ntest/src/Evaluator.cpp -o eval.o -mllvm -debug-only=misched clang (LLVM option parsing): Unknown command line argument '-debug-only=misched'. Try: 'clang (LLVM option parsing) -help' clang (LLVM option parsing): Did you mean '-debug-pass=misched'?
Regarding the ALU operations: I think I know what's going on with respect to the ALU->move
A pattern in the code is the following is to extract a row from a bitboard (a bitboard is a 64-bit binary representation of the board; usually two such bitboards are used to represent a position). The code is pretty straightforward:
base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff] + \
base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2
TConfig Row0 = BB_EXTRACT_ROW_PATTERN(0);
value += pR1[Row0];
TConfig Row1 = BB_EXTRACT_ROW_PATTERN(1);
value += pR2[Row1];
(and so on 6 more times, and then 8 more times for a flipped board to extract the columns as well)
So what's the issue? Well, base2ToBase3Table is an array of 32 bit integers. The address is actually ((empty >> (8 (ROW))) & 0xff) 4.
LLVM is folding the 4 operation with the AND and the shift. So you end up with (empty >> (8 ROW - 2)) & (0xff << 2 which is 0x3fc)
GCC and ICC don't do that; they instead use the more complex addressing modes. The number of instructions is the same though:
LLVM: 1: empty >> (8 * ROW - 2) 2: and #1 with 0x3fc 3: load #2
GCC and ICC: 1: empty >> (8 * ROW) 2: and with 0xff which is implemented as a movzbl 3: load #2 with a multiplier of 4
To improve the comparison I migrated movzbl and movzwl into the ALU section.
gcc: Total st stores loads st loads moves alu mult popcnt 547 9 122 9 81 297 28 1
clang with regular flags: Total st stores loads st loads moves alu mult popcnt 713 75 122 75 76 336 28 1
clang with joining of live intervals disabled: Total st stores loads st loads moves alu mult popcnt 694 37 122 37 133 336 28 1
Anyway:
a) I'm still trying to figure out where the reset of the difference in alu instructions is coming from
b) is it possible to disable as well the shift/multiply (by 4 in the example above) folded into an AND?
Thanks for the comments!
Passing down (-mllvm ) -join-liveintervals=false does help:
Total st stores loads st loads moves alu mult popcnt 694 37 122 37 147 322 28 1
Sure, there's more moves, but those should be cheaper than stores and loads (moves may even be handled entirely in the frontend with recent processors)
Andrew, I passed the following param: -mllvm -debug-only=misched
$ /usr/bin/clang++-3.5 -D_M_AMD64 -m64 -msse4 -O3 -std=c++0x -g3 -Wno-unused-result -Wno-deprecated -c /home/vlad/g2/ntest/src/Evaluator.cpp -o eval.o -mllvm -debug-only=misched clang (LLVM option parsing): Unknown command line argument '-debug-only=misched'. Try: 'clang (LLVM option parsing) -help' clang (LLVM option parsing): Did you mean '-debug-pass=misched'?
Regarding the ALU operations: I think I know what's going on with respect to the ALU->move
A pattern in the code is the following is to extract a row from a bitboard (a bitboard is a 64-bit binary representation of the board; usually two such bitboards are used to represent a position). The code is pretty straightforward:
base2ToBase3Table[(empty >> (8 * (ROW))) & 0xff] + \
base2ToBase3Table[(mover >> (8 * (ROW))) & 0xff] * 2
TConfig Row0 = BB_EXTRACT_ROW_PATTERN(0);
value += pR1[Row0];
TConfig Row1 = BB_EXTRACT_ROW_PATTERN(1);
value += pR2[Row1];
(and so on 6 more times, and then 8 more times for a flipped board to extract the columns as well)
So what's the issue? Well, base2ToBase3Table is an array of 32 bit integers. The address is actually ((empty >> (8 (ROW))) & 0xff) 4.
LLVM is folding the 4 operation with the AND and the shift. So you end up with (empty >> (8 ROW - 2)) & (0xff << 2 which is 0x3fc)
GCC and ICC don't do that; they instead use the more complex addressing modes. The number of instructions is the same though:
LLVM: 1: empty >> (8 * ROW - 2) 2: and #1 with 0x3fc 3: load #2
GCC and ICC: 1: empty >> (8 * ROW) 2: and with 0xff which is implemented as a movzbl 3: load #2 with a multiplier of 4
To improve the comparison I migrated movzbl and movzwl into the ALU section.
gcc: Total st stores loads st loads moves alu mult popcnt 547 9 122 9 81 297 28 1
clang with regular flags: Total st stores loads st loads moves alu mult popcnt 713 75 122 75 76 336 28 1
clang with joining of live intervals disabled: Total st stores loads st loads moves alu mult popcnt 694 37 122 37 133 336 28 1
Anyway:
a) I'm still trying to figure out where the reset of the difference in alu instructions is coming from
b) is it possible to disable as well the shift/multiply (by 4 in the example above) folded into an AND?
Are we generates moves that look like alu instructions?
Is the schedule the same?
-debug-only=misched should report the register pressure after coalescing somewhere in its output.
I haven’t looked at the test case, but based on the statistics I would say we, at least partly, pay the price of the aggressiveness of the register coalescer (see the numbers of moves).
You can try with -join-liveintervals=false to see what happens.
Also, do you know why we have so many alu operations? I do not think that those are added by the register allocator, so it is hard to blame the register allocator if the input unallocated code is not somehow close.
@llvm/issue-subscribers-backend-x86
Author: None (llvmbot)
Extended Description
The code in question is from an Othello minimax engine. https://github.com/vladpetric/ntest/branches (jmobs2 branch)
The static evaluation function of ntest is a huge basic block, of around 550 instructions. It's a complex basic block, with a lot of live values.
Unfortunately, clang produces no fewer than 75 spills! I count a spill as a store/load pair to the stack, which includes pushes, pops, moves and other instructions with an operand in memory and using %rsp (callee-managed registers are included here as well). At the same time gcc 4.8.2 generates 9, and icc only 4. Given that this is a single linear basic block, all instructions are executed exactly the same number of times.
Instruction statistics for the function CEvaluator::EvalMobs :
gcc 4.8.2: Total st stores loads st loads moves alu mult popcnt 547 9 122 9 117 261 28 1
icc: Total st stores loads st loads moves alu mult popcnt 541 4 122 4 122 260 28 1
clang (3.5 trunk): Total st stores loads st loads moves alu mult popcnt 704 76 122 75 81 321 28 1
A few notes:
there are no branches or non-stack stores. st loads mean stack loads, st stores mean stack stores.
non-stack loads are loads from evaluation tables. The mults come primarily from the "magic multiplier" bit gather trick. popcnt - well, that's explicitly called. There's a constant number of them.
an x86_64 ALU instruction that has a memory operand is counted twice - both ALU and load. Such operations are likely broken into two micro-ops, generally speaking (one load and one alu).
The code can be compiled by fetching the sourcecode, then a straightfoward cmake project, but I've also attached the .ll file which includes the pathological function (CEvaluator::EvalMobs).