llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.42k stars 11.74k forks source link

PowerPC - miscompile yields infinite looping when using -code-model=large #32894

Closed llvmbot closed 2 years ago

llvmbot commented 7 years ago
Bugzilla Link 33547
Version 4.0
OS Windows NT
Attachments reproducer reduced from larger program
Reporter LLVM Bugzilla Contributor
CC @hfinkel,@jsonn,@nemanjai

Extended Description

This bug appears in the release_40 branch on PowerPC.

To reproduce:

% opt -O2 bug.bc -o bug.opt.bc % llc bug.opt.bc -mcpu=native -code-model=large -o bug.s

In the prologue for z0042(), there should be the following sequence

BB#0: # %L.entry

    addis 3, 2, .LC0@toc@ha
    bl .L0$pb

.L0$pb: ld 4, .LC0@toc@l(3) mflr 3 lwz 4, 0(4) cmpwi 4, 1 bltlr 0

which is a call from the prologue to the next instruction. This call sends the program into an infinite loop.

The reproducer has reduced from a much larger code to isolate this specific code gen pattern.

The SVN commit that results in this failure was 287059. Reverting that changeset eliminates the infinite loop and corrects the test application.

nemanjai commented 6 years ago

Fix in https://reviews.llvm.org/D43677.

nemanjai commented 6 years ago

Ok, this time I can reproduce the failure. The problem is that shrink-wrapping doesn't see MovePCtoLR8 as a def of a CSR. It then freely inserts the prologue in a successor to entry. As a result, we end up clobbering the LR before we've saved it in the prologue, which is obviously bad.

I'm testing a fix and will update this PR tomorrow with some results.

llvmbot commented 6 years ago

reproducer for release_60 OK, I was able to whittle the application down.

This example should reproduce the bl -> bl instruction sequence. The resulting executable crashes with a Segmentation fault on my system.

llvmbot commented 6 years ago

It does look like I need a new reproducer, since the characteristics of the original have been tickled somehow over time.

llvmbot commented 6 years ago

Just updated to r325786 on release_60 this morning, rebuilt, and get the same bug. The build is on a Power8 machine wiht gcc 5.4.0 and the test is run on a Power8 machine. There are no mods to the LLVM sources at all in this build.

% git remote -v origin http://llvm.org/git/llvm.git (fetch) origin http://llvm.org/git/llvm.git (push) % git checkout release_60 % git pull origin release_60

% cmake ReleaseSixSourceDir -DLLVM_TARGETS_TO_BUILD=PowerPC -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_TERMINFO=OFF -DCMAKE_EXE_LINKER_FLAGS="-static-libgcc -static-libstdc++" % make % make install

Note again that this bug requires the large code model. If I remove that, the test application will pass/run to correct completion.

Here are the updated LLVM tool commands.

% opt -O2 ./test.ll -S -o ./test.llvm % llc ./test.llvm -mcpu=pwr8 -O2 -code-model=large -o ./test.s % /usr/bin/as ./test.s -mpower8 -many -o test.o

PS: The number in the previous post is the git version hash that corresponds to r287059 (see, e.g. Hal's comment, above).

PPS: Community editions of the PGI compilers are available for download.

nemanjai commented 6 years ago

Unfortunately, I'm still unable to reproduce this. We should try to look into what the differences are between your environment and mine.

nemanjai commented 6 years ago

A similar code gen issue happens in release_60. In the test application, we have a call sequence as follows.

0x0000000010000ccc <+436>: bl 0x10000dc0 => 0x0000000010000cd0 <+440>: nop

The function called (0x10000dc0) is:

0x0000000010000db8 <+0>: ld r2,-8(r12) 0x0000000010000dbc <+4>: add r2,r2,r12 => 0x0000000010000dc0 <+8>: bl 0x10000dc4 0x0000000010000dc4 <+12>: mflr r6

So, we have a bl instruction jumping to another bl instruction, which will immediately overwrite the link register from the first call. After that, the application fails.

Oh wow. There's something really wrong if the first instruction in the local entry point is an unconditional branch and link instruction. The loop makes sense to me now - we set the link register to the beginning of the prologue (0x0000000010000dc4), save it and presumably restore it so the blr brings control back to 0x0000000010000dc4.

I'll try to reproduce this.

As before, reverting 883332301 fixes the bug. What does this number mean? Is this the same revision (relative jump table encodings)?

llvmbot commented 6 years ago

A similar code gen issue happens in release_60. In the test application, we have a call sequence as follows.

0x0000000010000ccc <+436>: bl 0x10000dc0 => 0x0000000010000cd0 <+440>: nop

The function called (0x10000dc0) is:

0x0000000010000db8 <+0>: ld r2,-8(r12) 0x0000000010000dbc <+4>: add r2,r2,r12 => 0x0000000010000dc0 <+8>: bl 0x10000dc4 0x0000000010000dc4 <+12>: mflr r6

So, we have a bl instruction jumping to another bl instruction, which will immediately overwrite the link register from the first call. After that, the application fails.

As before, reverting 883332301 fixes the bug.

llvmbot commented 6 years ago

Second question:

The code sequence

    bl .L0$pb

.L0$pb: ld 4, .LC0@toc@l(3)

loads the link register with the address of the ld instruction.

The code sequence

    bltlr

branches to the link register if the condition is less than. The link register that is 4 instructions previous. This forms a conditional loop.

If we look at the body of the loop, there are no instructions that can change the condition, so if it is less than, it will always be less than in future iterations.

First question:

I am not sure if this bug was fixed after release 4.0. It is possible.

nemanjai commented 6 years ago

Does this still reproduce for you on ToT? I can't seem to reproduce it. In fact, there is no jump table in the produced assembly for the switch.

Furthermore, I'm not sure I really understand what the claimed problem is/was. Was it that the contents of the link register are incorrect so the bltlr branches to the wrong place? Or was it that the branch wasn't taken when it should have been and then we loop forever in the loop within the test case? Or something entirely different?

llvmbot commented 7 years ago

% llc --version LLVM (http://llvm.org/): LLVM version 4.0.1 DEBUG build with assertions. Default target: powerpc64le-unknown-linux-gnu Host CPU: pwr8

Registered Targets: ppc32 - PowerPC 32 ppc64 - PowerPC 64 ppc64le - PowerPC 64 LE

llvmbot commented 7 years ago

By the way, the test provided is executable. When assembled and linked, the a.out file will run forever.

llvmbot commented 7 years ago

It's right at the top of the function. It is after the basic block #​0 comment. So, replace "in the prologue" with "immediately after the prologue".

.Lfunc_toc0: # @​z0042 .quad .TOC.-.Lfunc_gep0 z0042: .Lfunc_begin0: .Lfunc_gep0: ld 2, .Lfunc_toc0-.Lfunc_gep0(12) add 2, 2, 12 .Lfunc_lep0: .localentry z0042, .Lfunc_lep0-.Lfunc_gep0

BB#0: # %L.entry

    addis 3, 2, .LC0@toc@ha
    bl .L0$pb

.L0$pb:

jsonn commented 7 years ago

Well, is it looping in the prologue or really in the epilogue? I don't understand how it could do the former, but there might be a clobber issue for the latter.

hfinkel commented 7 years ago

FYI: r287059 - Always use relative jump table encodings on PowerPC64.

llvmbot commented 7 years ago

assigned to @nemanjai

nemanjai commented 2 years ago

This has long since been fixed. I'm not sure why I never closed the issue/bug.