Open melver opened 2 months ago
@llvm/issue-subscribers-backend-x86
Author: Marco Elver (melver)
It looks like gcc assumes that extern int var
can't be in an .so
, while clang does not. E.g. even for -pie
one can do extern declaration for some var provided by libc.
Is it possible to fix with some attributes on the variable and/or additional compiler flags? Attributes is not what we want, but it may give hints as to why clang is doing this.
FWIW I worked around this in tcmalloc by providing both extern declaration and weak definition in a header:
https://github.com/google/tcmalloc/blob/2d35fee5fdf26833ae628f5c9f86f6274fd5fee5/tcmalloc/internal/percpu.h#L167-L181
But this does not scale.
It looks like gcc assumes that extern int var can't be in an .so [...]
This depends on the relocation model. By default I was under the impression it should just use copy relocations, which means that the main binary has a copy of all variables it accesses incl. those in .so. The .so will then use the GOT to find the real location of the variable in the main binary.
https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected
@MaskRay
FWIW, it looks like this was an optimization done in GCC by Google back in the day, but lost in LLVM: https://gcc.gnu.org/git/?p=gcc.git&a=commit;h=77ad54d911dd7cb88caf697ac213929f6132fdcf
Ok, it looks like fno-direct-access-external-data is the default:
% clang++ -O2 -fPIE -fdirect-access-external-data -S -o - test.cc
...
_Z3foov: # @_Z3foov
.cfi_startproc
# %bb.0:
incl var(%rip)
retq
I see https://reviews.llvm.org/D92633 - but can't find why the current default is better.
Clang's GOT-generating code sequence (for the external default visibility non-local variable) works as intended.
clang -fdirect-access-external-data -fPIE
generates the GCC-like code sequence.
https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected discusses that GCC 5's x86 behavior change is actually a behavior regression.
It's a concious decision that we don't want to follow the GCC x86-64 behavior.
It's a concious decision that we don't want to follow the GCC x86-64 behavior.
Yes, I found https://gcc.gnu.org/legacy-ml/gcc/2019-05/msg00215.html, too.
But was there any discussion how to fix the common case? I.e. where we have a mostly static PIE binary that just references data that is linked into the final executable?
I see LLD ends up relaxing these GOT accesses in "lld/ELF/Arch/X86_64.cpp:relaxGot()" into a lea, but then we end up with lots of patterns like this:
leaq 0x123456(%rip), %rax
movq (%rax), %rax
Is there any post-link optimizer that fixes that?
I'm almost tempted to teach LLD to turn trivial patterns like that into just movq 0x123456(%rip), %rax
, but that's really awful if that's the only solution.
It's a concious decision that we don't want to follow the GCC x86-64 behavior.
Yes, I found gcc.gnu.org/legacy-ml/gcc/2019-05/msg00215.html, too.
Yes, the discussion is that for the default case we want to avoid copy relocations... At the cost of minor penalty for mostly statically linked executables for server applications.
But was there any discussion how to fix the common case? I.e. where we have a mostly static PIE binary that just references data that is linked into the final executable?
I see LLD ends up relaxing these GOT accesses in "lld/ELF/Arch/X86_64.cpp:relaxGot()" into a lea, but then we end up with lots of patterns like this:
leaq 0x123456(%rip), %rax movq (%rax), %rax
Is there any post-link optimizer that fixes that?
I'm almost tempted to teach LLD to turn trivial patterns like that into just
movq 0x123456(%rip), %rax
, but that's really awful if that's the only solution.
We cannot delete bytes from sections for x86, which restricts the available optimization schemes. LLD's choice follows x86-64-abi. movq 0x123456(%rip), %rax
is 3-byte shorter than leaq+movq. Are you suggesting that movq+nop may be faster than leaq+movq? If yes, x86-64-abi might be the place to discuss alternative code sequences.
It's possible that we might end up with a new relocation type...
[...]
We cannot delete bytes from sections for x86, which restricts the available optimization schemes. LLD's choice follows x86-64-abi.
movq 0x123456(%rip), %rax
is 3-byte shorter than leaq+movq. Are you suggesting that movq+nop may be faster than leaq+movq? If yes, x86-64-abi might be the place to discuss alternative code sequences.
Yes, movq+nop was what I had in mind.
It's possible that we might end up with a new relocation type...
In trivial cases where it is obvious to determine that the address register isn't a dependency for later instructions, i.e. the lea ..., %reg; mov (%reg), %reg
case, it appears to be provably correct to just do mov+nop
.
Consider this input TU:
With Clang (trunk) and -fPIE:
Notice the
movq var@GOTPCREL(%rip), %rax
.With GCC (13.2):
GCC correctly produces the more optimal
addl $1, var(%rip)
.It appears that Clang's -fPIE simply reverts to -fPIC:
Which is also roughly what GCC produces with -fPIC.
And by default Clang seems to produce similarly bad code: