Bad x86 codegen for globals access with -fPIE

melver commented 2 months ago

Consider this input TU:

// test.cc
extern int var;

void foo() {
  var++;
}

With Clang (trunk) and -fPIE:

% clang++ -O2 -fPIE -S -o - test.cc
...
_Z3foov:                                # @_Z3foov
        .cfi_startproc
# %bb.0:
        movq    var@GOTPCREL(%rip), %rax
        incl    (%rax)
        retq

Notice the movq var@GOTPCREL(%rip), %rax.

With GCC (13.2):

% g++ -O2 -fPIE -S -o - test.cc
...
_Z3foov:
.LFB0:
        .cfi_startproc
        addl    $1, var(%rip)
        ret
        .cfi_endproc

GCC correctly produces the more optimal addl $1, var(%rip).

It appears that Clang's -fPIE simply reverts to -fPIC:

% clang++ -O2 -fPIC -S -o - test.cc
...
_Z3foov:                                # @_Z3foov
        .cfi_startproc
# %bb.0:
        movq    var@GOTPCREL(%rip), %rax
        incl    (%rax)
        retq

Which is also roughly what GCC produces with -fPIC.

And by default Clang seems to produce similarly bad code:

% clang++ -O2 -S -o - test.cc
...
_Z3foov:                                # @_Z3foov
        .cfi_startproc
# %bb.0:
        movq    var@GOTPCREL(%rip), %rax
        incl    (%rax)
        retq

llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-x86

Author: Marco Elver (melver)

Consider this input TU: ``` // test.cc extern int var; void foo() { var++; } ``` With Clang (trunk) and -fPIE: ``` % clang++ -O2 -fPIE -S -o - test.cc ... _Z3foov: # @_Z3foov .cfi_startproc # %bb.0: movq var@GOTPCREL(%rip), %rax incl (%rax) retq ``` Notice the `movq var@GOTPCREL(%rip), %rax`. With GCC (13.2): ``` % g++ -O2 -fPIE -S -o - test.cc ... _Z3foov: .LFB0: .cfi_startproc addl $1, var(%rip) ret .cfi_endproc ``` GCC correctly produces the more optimal `addl $1, var(%rip)`. It appears that Clang's -fPIE simply reverts to -fPIC: ``` % clang++ -O2 -fPIC -S -o - test.cc ... _Z3foov: # @_Z3foov .cfi_startproc # %bb.0: movq var@GOTPCREL(%rip), %rax incl (%rax) retq ``` Which is also roughly what GCC produces with -fPIC. And by default Clang seems to produce similarly bad code: ``` % clang++ -O2 -S -o - test.cc ... _Z3foov: # @_Z3foov .cfi_startproc # %bb.0: movq var@GOTPCREL(%rip), %rax incl (%rax) retq ```

dvyukov commented 2 months ago

It looks like gcc assumes that extern int var can't be in an .so, while clang does not. E.g. even for -pie one can do extern declaration for some var provided by libc. Is it possible to fix with some attributes on the variable and/or additional compiler flags? Attributes is not what we want, but it may give hints as to why clang is doing this. FWIW I worked around this in tcmalloc by providing both extern declaration and weak definition in a header: https://github.com/google/tcmalloc/blob/2d35fee5fdf26833ae628f5c9f86f6274fd5fee5/tcmalloc/internal/percpu.h#L167-L181 But this does not scale.

melver commented 2 months ago

It looks like gcc assumes that extern int var can't be in an .so [...]

This depends on the relocation model. By default I was under the impression it should just use copy relocations, which means that the main binary has a copy of all variables it accesses incl. those in .so. The .so will then use the GOT to find the real location of the variable in the main binary.

https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected

@MaskRay

melver commented 2 months ago

FWIW, it looks like this was an optimization done in GCC by Google back in the day, but lost in LLVM: https://gcc.gnu.org/git/?p=gcc.git&a=commit;h=77ad54d911dd7cb88caf697ac213929f6132fdcf

melver commented 2 months ago

Ok, it looks like fno-direct-access-external-data is the default:

% clang++ -O2 -fPIE -fdirect-access-external-data -S -o - test.cc
...
_Z3foov:                                # @_Z3foov
        .cfi_startproc
# %bb.0:
        incl    var(%rip)
        retq

I see https://reviews.llvm.org/D92633 - but can't find why the current default is better.

MaskRay commented 2 months ago

Clang's GOT-generating code sequence (for the external default visibility non-local variable) works as intended. clang -fdirect-access-external-data -fPIE generates the GCC-like code sequence.

https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected discusses that GCC 5's x86 behavior change is actually a behavior regression.

It's a concious decision that we don't want to follow the GCC x86-64 behavior.

melver commented 2 months ago

It's a concious decision that we don't want to follow the GCC x86-64 behavior.

Yes, I found https://gcc.gnu.org/legacy-ml/gcc/2019-05/msg00215.html, too.

But was there any discussion how to fix the common case? I.e. where we have a mostly static PIE binary that just references data that is linked into the final executable?

I see LLD ends up relaxing these GOT accesses in "lld/ELF/Arch/X86_64.cpp:relaxGot()" into a lea, but then we end up with lots of patterns like this:

leaq 0x123456(%rip), %rax
movq (%rax), %rax

Is there any post-link optimizer that fixes that?

I'm almost tempted to teach LLD to turn trivial patterns like that into just movq 0x123456(%rip), %rax, but that's really awful if that's the only solution.

MaskRay commented 2 months ago

It's a concious decision that we don't want to follow the GCC x86-64 behavior.

Yes, I found gcc.gnu.org/legacy-ml/gcc/2019-05/msg00215.html, too.

Yes, the discussion is that for the default case we want to avoid copy relocations... At the cost of minor penalty for mostly statically linked executables for server applications.

But was there any discussion how to fix the common case? I.e. where we have a mostly static PIE binary that just references data that is linked into the final executable?

I see LLD ends up relaxing these GOT accesses in "lld/ELF/Arch/X86_64.cpp:relaxGot()" into a lea, but then we end up with lots of patterns like this:
leaq 0x123456(%rip), %rax
movq (%rax), %rax
Is there any post-link optimizer that fixes that?

I'm almost tempted to teach LLD to turn trivial patterns like that into just movq 0x123456(%rip), %rax, but that's really awful if that's the only solution.

We cannot delete bytes from sections for x86, which restricts the available optimization schemes. LLD's choice follows x86-64-abi. movq 0x123456(%rip), %rax is 3-byte shorter than leaq+movq. Are you suggesting that movq+nop may be faster than leaq+movq? If yes, x86-64-abi might be the place to discuss alternative code sequences.

It's possible that we might end up with a new relocation type...

melver commented 2 months ago

[...]

We cannot delete bytes from sections for x86, which restricts the available optimization schemes. LLD's choice follows x86-64-abi. movq 0x123456(%rip), %rax is 3-byte shorter than leaq+movq. Are you suggesting that movq+nop may be faster than leaq+movq? If yes, x86-64-abi might be the place to discuss alternative code sequences.

Yes, movq+nop was what I had in mind.

It's possible that we might end up with a new relocation type...

In trivial cases where it is obvious to determine that the address register isn't a dependency for later instructions, i.e. the lea ..., %reg; mov (%reg), %reg case, it appears to be provably correct to just do mov+nop.

llvm / llvm-project

Bad x86 codegen for globals access with -fPIE #108511