llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.81k stars 11.91k forks source link

performance regression in clang-19 when using computed goto #106846

Open mikulas-patocka opened 2 months ago

mikulas-patocka commented 2 months ago

Hi

I noticed that the Ajla programming language ( https://www.ajla-lang.cz/ ) runs slower when being compiled by clang-19 compared to clang-18 or gcc-14. I looked at the assembler, and it turns out that the bytecode interpreter (the file "ipret.c") is not compiled as efficiently as it could be. In particular, clang-19 joins all the "goto next_label" statements in a function into just one "jmp " instruction. That reduces code size, but it also makes branch prediction inefficient, because the CPU cannot learn that a single instruction jumps to multiple targets.

I created this example that shows the issue: http://www.jikos.cz/~mikulas/testcases/clang/computed-goto.c (use: clang-19 -O2 computed-goto.c && time ./a.out)

The results (in seconds) are here: http://www.jikos.cz/~mikulas/testcases/clang/computed-goto.txt

We can see that the worst slowdown happens on Sandy Bridge. Zen 4 shows no slowdown, so it seems that it has smart indirect branch predictor that can predict multiple jump targets from a single instruction.

nikic commented 2 months ago

This is caused by https://github.com/llvm/llvm-project/pull/78582, we get separate indirect gotos with -mllvm -tail-dup-pred-size=30. cc @DianQK Possibly we should not apply the limitation to INDIRECTBR, which is more profitable to tail-duplicate than other instructions.

mikulas-patocka commented 2 months ago

-mllvm -tail-dup-pred-size=30 helps the testcase, but not Ajla. If you compile Ajla with clang-19 -O2 -mllvm -tail-dup-pred-size=30 and do $ objdump -d ipret.o | grep 'jmp \*' it shows that there is just one indirect jump in the whole file.

nikic commented 2 months ago

What about -mllvm -tail-dup-pred-size=1000? The value needs to be about as large as the number of instructions / indirect jumps.

mikulas-patocka commented 2 months ago

Yes - there are 2284 instructions in the Ajla interpreter. So, I set tail-dup-pred-size=3000 and I get the same output as with clang-18 - 2466 indirect jumps in ipret.o.

pinskia commented 2 months ago

Note GCC (RTL) has a specific pass which duplicates the computed gotos (as GCC merges all computed goto into one BB for performance reasons) which was added in GCC 4.0 and then improved for GCC 7.1.0 (to also unduplicate some related instructions). This pass happens late in the pipeline after register allocation. This is just explaining GCC's solution to the problem and nothing more.

DianQK commented 2 months ago

This is caused by #78582, we get separate indirect gotos with -mllvm -tail-dup-pred-size=30. cc @DianQK Possibly we should not apply the limitation to INDIRECTBR, which is more profitable to tail-duplicate than other instructions.

It looks like makes sense. Thanks! I will check it later.

DianQK commented 2 months ago

The switch instruction also becomes an INDIRECTBR machine instruction, and I think we should apply the limitation to PHI. If I remember correctly, both https://github.com/llvm/llvm-project/issues/78578 and https://github.com/llvm/llvm-project/pull/79993#issuecomment-1936822679 contain a massive PHI. Duplicating the BB that contains the massive PHI causes the CFG to become exceptionally complex, leading to issues with compile time and runtime performance.

DianQK commented 1 month ago

@mikulas-patocka Can you provide an example of Ajla? I may need to investigate this with the help of it.

BTW, whether using INDIRECTBR directly or coming over from SWITCH, we introduce a lot of PHIs.

mikulas-patocka commented 1 month ago

The Ajla compiler is written in Ajla, you can have a look at files in newlib/compiler/. I usually use self-compilation as a benchmark.

Run './configure && make' with CFLAGS='-O2 -DDEBUG_ENV'. The DEBUG_ENV macro enables envirnment variables that are useful for debugging.

Set the environment variable 'export CG=none' - this will disable the machine code generator and use only the interpreter.

Run the script ./scripts/update.sh - this will compile the Ajla compiler itself and it will re-generate the file builtin.pcd. The results (on i7-2640M, Sandy Bridge, 2 cores, 4 threads): gcc-14: 22.8s clang-18: 22.2s clang-19: 26.8s The results (on 7840U, Zen 4, 8 cores, 16 threads): gcc-14: 4.1s clang-18: 4.1s clang-19: 4.4s

If you want some simple benchmark, copy this piece of code to a file loop.ajla

fn main
[
        var sum := 0;
        for i := 0 to ston(args[0]) do
                sum += 1;
        write(h[1], ntos(sum) + nl);
]

and run it with 'CG=none ./ajla --nosave loop.ajla 1000000000' The results (on i7-2640M): gcc-14: 17.3s clang-18: 15.3s clang-19: 39.0s On 7840U there is too much jitter in this test.

DianQK commented 6 days ago

By comparing the results from gcc, I can observe the following:

Next, I'll look into some gcc code to see if I can find anything useful. Perhaps gcc is simply differentiating between goto and switch at the source level? :) Although I think they should be interchangeable in this scenario.

Detailed commands

``` $ time -f '%M' gcc -O3 -c oom_manual2.c -S 129608 $ cat oom_manual2.s | grep "jmp.*%" | wc -l 26 $ wc -l oom_manual2.s 18494 oom_manual2.s $ time -f '%M' clang -O3 -c oom_manual2.c -mllvm -tail-dup-pred-size=3000 -mllvm -tail-dup-succ-size=3000 -S 836248 $ cat oom_manual2.s | grep "jmp.*%" | wc -l 3081 $ wc -l oom_manual2.s 50527 oom_manual2.s $ time -f '%M' gcc -O3 -S -c computed-goto.c 47556 $ cat computed-goto.s | grep "jmp.*%" | wc -l 32 $ wc -l computed-goto.s 1184 computed-goto.s $ time -f '%M' clang -O3 -mllvm -tail-dup-pred-size=3000 -mllvm -tail-dup-succ-size=3000 -S -c computed-goto.c 119084 $ cat computed-goto.s | grep "jmp.*%" | wc -l 51 $ wc -l computed-goto.s 1717 computed-goto.s ```