Open llvmbot opened 14 years ago
Wow! That is awesome information. I remember pulling my hair out trying to figure this out experimentally. Thanks!
This is great information! You might consider sending it to the llvmdev list as well, which has a lot broader exposure than this bugzilla does. I think many other folks would benefit from this.
I know this is super old, but I took a quick look at this issue and the test-case attached to pr3120 to see if anything jumped out at me. Mostly for educational purposes, and also to see if there are any opportunities.
Since this report is very old, it’s unclear on which architecture the performance swings were reported and, perhaps more importantly, whether we care about those architectures today, or not.
I chose to play around with it a little on today’s hardware to see if there are still any alignment issues. It actually turned out that with “0 mod 32” vs “16 mod 32” byte alignment, the benchmark did show significant swings (~50%-70%) on an IVB and HSW.
The reason for the swings wasn’t immediately obvious, but some deeper analysis pointed me to the issue being within the DSB (the post decode uop cache). I wrote up a detailed presentation of what’s going on so that I could share it with the rest of my team for educational purposes (attached).
The quick summary is : The DSB caches post-decoded uops that are frequently executed so that front-end pipeline stages and overhead can be bypassed, allowing to feed 32B worth of instructions per clock, instead of 16B. The DSB allows 3 ways (each of which can hold 6 uops) to be allocated to each 32B chunk of instructions (by IP address). Unconditional branches always end a way. If the code is aligned and laid out in such a way as to require more than 3 ways per 32B chunk of instructions in tightly packed code with lots of JMP instructions, then we can get into situations where we keep flip-flopping execution in and out of the DSB vs. Front End. This can be inefficient and incurs additional penalties. You can find additional details in the presentation, and also in the public Intel Optimization Manual.
It’s tricky to decide whether something can/should be done about this, or not. One option is to pad code whenever we detect multiple jmp instructions in a potential 32B chunk of instructions (specifically, more than 3). This may cause unnecessary code bloat with no payoff, but it could also be rare enough to be insignificant padding that may help boost performance in those rare cases. I plan on playing around with this a little to see how many cases we can catch in SPEC, for example, and measure bloat vs. perf to see if it’s a viable solution.
The other option would be to do nothing, and make do with simply understanding what the problem is so that it can be identified in the future. Architectures change rapidly, and this could be something that goes away soon.
In either case, I’ll probably pursue the first option above and report back on what I find.
Regarding the other details reported in this issue, I realize that the slow vs. fast cases both had 0 mod 32 byte alignment. It’s hard to do the analysis on what the issue there was, without having the exact code and the exact (old) architecture on which it was run. If I had to guess, I would say that it was a case of unfortunate aliasing in the branch prediction buffer, causing differences in the prediction of one of the many branches, particularly the indirect branch, which is known to have prediction issues on some older architectures.
Feel free to contact me if you’d like additional info.
Thanks, Zia Ansari.
I think the first action on this should be to find out the cause of these huge variations. I looked through some info on x86-64 performance but didn't see anything that quite matched this behavior. If we know the cause, we can then take the next step and see if there is anything we can do in llvm to address it. It's hard to speculate on that without knowing what might be required.
If this were a 10-20% performance degradation, I wouldn't hesitate to gloss over it. Instead, though, it's a 250% degradation. Ignoring that seems like a bad idea.
Bob, this bug doesn't seem "actionable". What should we do with it?
Extended Description
See pr3120 for background and testcase. The "switched interpreter" runtime degraded from 240 to 584 when I changed llvm to tail duplicate indirect branches. The change affected code that does not run for the "switched interpreter", and the changes are located after the switched interpreter code. The only effect of the change (aside from the "threaded interpreter" code) was that the linker adjusted the starting offsets of various functions. The "interpret_switch" function is aligned to a 16-byte boundary but that is apparently not good enough to get consistently good performance.
When the start of that function was at 0x100001830, the performance was very good (240). At 0x100001810, it was bad (584). At 0x100001820, it was even worse (617). The latter case occurred when I manually deleted some .align directives from the threaded interpreter assembly. When I edited the assembly to increase the alignment of interpret_switch to 32 bytes, that function was placed at 0x100001800 and performance improved (291).
I'm not sure what is causing these performance variations. The branch predictor in some x86 processors fetches aligned blocks of 32-bytes, so it may be related to that. If it was that simple, I don't know how to explain the huge differences in performance when the function was at 1800 vs 1820 or 1810 vs. 1830.
To reproduce, run "make CC=llvm-gcc-4.2" with the testcase from pr3120 and then run the test with "intrp data/*".
I used the following patch to make the test run only the "switched" interpreter:
--- interpret.c.orig 2009-11-25 13:41:47.000000000 -0800 +++ interpret.c 2009-11-25 13:41:20.000000000 -0800 @@ -145,6 +145,7 @@ Interpreter interpreters[] = { { &interpret_switch, "Switched interpreter" },
ifdef BLOCKS
The current llvm trunk has the "bad" behavior. If you manually edit the assembly for interpret.c and remove the .align directives from the inside for the interpret_threaded function, the offset of interpret_switch should change.