Closed kbeyls closed 2 years ago
I still very firmly disagree. If the linker rejects a valid ELF file because it's got itself in a twist and can't undo that, that's a linker bug.
I share Tim's opinion. However, even fixing this bug today in the linker (it seems the fix is in progress), it still won't work for the next distributions to come, that use current linkers.
Would it be possible to do the hack in the compiler today (with a very big FIXME), set this bug as a blocker to 3.7 and make sure we can re-enable it in the future when the linker is fixed?
If necessary, I can liaise with the Linaro GNU developers to backport that fix onto the next stable binutils, so that distros that use our toolchain could benefit without having to wait several months to get the proper fix.
Oh, and if you can't actually fix the linker an error is strongly preferred to a silent codegen failure (rule #0 of compiler development).
Considering the nature of linker flow, compiler output has to match the TLS access patterns assumed by linker.
I still very firmly disagree. If the linker rejects a valid ELF file because it's got itself in a twist and can't undo that, that's a linker bug.
Maybe the least bad solution is for the compiler to accommodate the linker, but let's be under no illusions here: the reason will be that the linker has a broken design that's more difficult to fix than the compiler. Not that it's better for everyone concerned.
In the linker TLSDESC optimization, linker uses register r0 because the result of the TLSDESC access sequence must be held on r0 (returned by blr). Ideally, if linker is able to put the resulting "movz; movk" instructions in the place of original "blr", then it can accommodate any access sequence. However, for current aarch64 linker, it is almost impossible to squeeze 2 instructions into one slot without changing linker's work flow.
Linker can check TLS access sequence before doing TLS relaxation. It does so on GD->LE relaxation. We can add checks on TLSDESC too. However, the bad news is, when linker finds the tls access sequence is not all right, it will fail and exit, reporting an error. It can not go back and do relocation without TLS relaxation. The reason is that linker allocates GOT and PLT spaces at very early stage, where the TLS relaxation decision has to be made. When linker comes to do the real relocation work, it is too late to change the decision since this symbol does not have a space in GOT or PLT at all.
Clang generates different TLS access sequences than GCC. Before I was aware of this post, I have seen "adrp; ldr; add; blr", or "adrp; add; adrp; ldr; blr", or "adrp; ldr; adrp; add; blr". Linker can accommodate them well, no matter what register adrp is using.
IMHO, both compiler and linker should work together to achieve better performance in a whole. Considering the nature of linker flow, compiler output has to match the TLS access patterns assumed by linker. If there are new patterns that make sense, we can add those patterns into linker.
I think that scheduling issue is something we should push back extremely firmly about.
Actually, that was too hasty. The linker relaxation appears to be a very real optimisation too, possibly greater than the compiler's in some cases.
I think the correct way to word it would be a suggestion though ("compilers are advised to produce this exact sequence to aid linker optimisation; linkers that see this sequence may...").
There's still no excuse for not processing relocations correctly as seen. It's a buggy optimisation plain and simple. It's like us telling programmers to declare their variables volatile because "we've not yet fully determined what optimisations may be possible" and then blaming them when they get bad code for non-volatiles.
The linker should bail if the sequence isn't exactly as written (the data-flow is completely self-contained, so this is possible with a local analysis).
(Of course, there's also GAS support if you care about such things).
It looks like that covers the extra relocations we emit (though Kristof has more recent experience with exhaustive testing of this).
Would that help?
I think that scheduling issue is something we should push back extremely firmly about.
First, it defeats the entire purpose separate ELF relocations. If they have to be in a block, the specification should have just created an R_AARCH64_HORRIFIC_TLS_HACK covering the 5 instructions (or whatever) and used that.
Second, we have a real compiler optimisation this is preventing, in the name of some hypothetical future linker change that there aren't even plans for. One implemented based on real public specifications.
Basically, I think that document should never see the light of day.
FURTHER ANALYSIS:
One further reason why thread_local variables don't work on AArch64-linux for LLVM-generated code is that the current draft specification states:
"The compiler is not allowed to schedule the sequences below. This restriction is present because we have not yet fully determined what linker optimisations may be possible. In order to facilitate adding linker optimisations in the future, without recompiling current code, the compiler is restricted from scheduling these sequences."
In other words, current linkers are allowed to assume that accesses to TLS variables are done by one of the following exact sequences (I've only listed the sequences for the small code model with a 16MiB TLS area size and -mtls-dialect=desc):
Where ... appears in a code sequence the compiler may insert zero or more arbitrary instructions. All references to registers, other than x0 used as an argument or return value, are arbitrary and the compiler is free to use any register.
General Dynamic:
adrp x0, :tlsdesc:var R_AARCH64_TLSDESC_ADR_PAGE21 var
ldr x1, [x0, #:tlsdesc_lo12:var] R_AARCH64_TLSDESC_LD64_LO12 var
add x0, x0, #:tlsdesc_lo12:var R_AARCH64_TLSDESC_ADD_LO12 var
.tlsdesccall var
blr x1 R_AARCH64_TLSDESC_CALL var
...
mrs tp, tpidr_el0
add x0, tp, x0
or
mrs tp, tpidr_el0
ldr x0, [tp, x0]
Local Dynamic: See General Dynamic. "To avoid the definition of new relocations, the linker simply defines for all modules that have a TLS section a hidden per-module symbol called _TLS_MODULEBASE which denotes the beginning of the TLS section for that module."
Initial Exec:
0x00 mrs tp, tpidr_el0
...
0x04 adrp t0, :gottprel:x1 R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21 x1
0x08 ldr t0, [t0, #:gottprel_lo12:x1] R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC x1
...
0x0c add t0, tp
Local Exec:
0x00 mrs tp, tpidr_el0
...
0x20 add t0, tp, #:tprel_hi12:x1, lsl #​12 R_AARCH64_TLSLE_ADD_TPREL_HI12 x2
0x24 add t0, #:tprel_lo12_nc:x1 R_AARCH64_TLSLE_ADD_TPREL_LO12_NC x2
(instead of the last add, the following can be used if the value of the variable,
instead of its adress, is needed:
0x24 ldr t0, [t0, #:tprel_lo12_nc:x1] R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC x2
)
At the moment, LLVM schedules the instruction sequences produced, leading to mis-"relaxation" in the linker. An example, somewhat derived and simplified from how the thread_local variable in lib/Support/PrettyStackTrace.cpp is used, is:
C code:
extern _Thread_local void* t;
int main() {
void* res=0;
for (int i=0;i<100; ++i)
res += (int)&t;
return (int)res;
}
objdump -rd of the clang-produced object file. The marked lines show that part of the TLS access code sequence has been lifted out of the loop, and it’s value "cached" in register x9.
0000000000000000 <main>:
0: a9bf7bfd stp x29, x30, [sp,#-16]!
4: 910003fd mov x29, sp
8: aa1f03e8 mov x8, xzr
c: 52800c8a mov w10, #​0x64 // #​100
* 10: 90000009 adrp x9, 0 <t>
* 10: R_AARCH64_TLSDESC_ADR_PAGE21 t
* 14: 91000129 add x9, x9, #​0x0
* 14: R_AARCH64_TLSDESC_ADD_LO12_NC t
18: 9000000b adrp x11, 0 <t>
18: R_AARCH64_TLSDESC_ADR_PAGE21 t
1c: f940016b ldr x11, [x11]
1c: R_AARCH64_TLSDESC_LD64_LO12_NC t
20: aa0903e0 mov x0, x9
24: d63f0160 blr x11
24: R_AARCH64_TLSDESC_CALL t
28: d53bd04c mrs x12, tpidr_el0
2c: 0b00018c add w12, w12, w0
30: 8b2cc108 add x8, x8, w12, sxtw
34: 5100054a sub w10, w10, #​0x1
38: 35ffff4a cbnz w10, 20 <main+0x20>
3c: 2a0803e0 mov w0, w8
40: a8c17bfd ldp x29, x30, [sp],#16
44: d65f03c0 ret
ld then misoptimizes/mis-relaxes this code by assuming that always register x0 is used in the presence of these TLS relocations. That probably is a fair assumption in case the sequence of instructions have to remain unscheduled, but seems to be a broken assumption when a compiler optimizes/reschedules the code sequences, which is what LLVM has done above.
0000000000400650 <main>:
400650: a9bf7bfd stp x29, x30, [sp,#-16]!
400654: 910003fd mov x29, sp
400658: aa1f03e8 mov x8, xzr
40065c: 52800c8a mov w10, #​0x64 // #​100
* 400660: d2a00000 movz x0, #​0x0, lsl #​16
* 400664: d503201f nop
400668: d2a00000 movz x0, #​0x0, lsl #​16
40066c: f2800200 movk x0, #​0x10
400670: aa0903e0 mov x0, x9
400674: d503201f nop
400678: d53bd04c mrs x12, tpidr_el0
40067c: 0b00018c add w12, w12, w0
400680: 8b2cc108 add x8, x8, w12, sxtw
400684: 5100054a sub w10, w10, #​0x1
400688: 35ffff4a cbnz w10, 400670 <main+0x20>
40068c: 2a0803e0 mov w0, w8
400690: a8c17bfd ldp x29, x30, [sp],#16
400694: d65f03c0 ret
Bottom line: the specification states that the instruction sequences must not be rescheduled, and the linkers make assumptions based on that.
This seems like more of a GNU bug: these relocations are in the ABI.
That said, we could probably work around it while improving LLVM at the same time. Instead of ripping out the existing code, we should make it the switchable option that was intended. Changing the default to match GCC wouldn't be unreasonable either.
IA-64 seems to have a similar issue and use "-mtls-size" (not in LLVM, obviously). I was sure there was another platform with the issue too (PPC?), but can't find anything searching now. Maybe I'm just remembering -mpic vs -mPIC.
It seems that some of the TLS relocations that clang produces right now just aren't supported by ld.bfd, and maybe some of them may also not be supported by ld.gold.
My understanding is that the GNU toolchain currently supports only the following code model with TLS: small PIC + TLS(16MiB).
It seems that the most practical solution for the time being (until linkers available in the most common distributions do support more TLS relocations), is to make sure that clang only produces what ld.bfd & ld.gold support on ELF-based systems.
In an attempt to see what gcc actually produces, and how clang produces different code, I've created & run the following short python script: import os f = open("t.c", "w") f.write(""" extern _Thread_local int x; static _Threadlocal int y; void f1(int*a) { a=&x; } void f2(inta) { a=x; } void f3(int*a) { a=&y; } void f4(inta) { a=y; } """) f.close() for fpicLabel, fpicOption in (("PIC","-fPIC"),("noPIC","")): cmd = "clang-3.5 %(fpicOption)s -O3 -std=c11 t.c -c -S -o- > t.clang-3.5%(fpicLabel)s.s" % locals() os.system(cmd) for tlsDialect in ("trad","desc"): cmd = "gcc %(fpicOption)s -O3 -std=c11 t.c -c -mtls-dialect=%(tlsDialect)s -S -o- > t.gcc491%(fpicLabel)s%(tlsDialect)s.s" % locals() os.system(cmd)
This produces the .s files in attachment.
Looking at the main difference between the gcc- and clang-produce .s files, it seems that the main difference is in function f3/f4 accessing static thread_local variables:
gcc produces: f3: mrs x1, tpidr_el0 add x1, x1, #:tprel_hi12:.LANCHOR0 add x1, x1, #:tprel_lo12_nc:.LANCHOR0 str x1, [x0]
clang produces: f3: // @f3 // BB#0: movz x8, #:tprel_g1:y movk x8, #:tprel_g0_nc:y mrs x9, TPIDR_EL0 add x8, x9, x8 str x8, [x0]
My understanding is the the sequence produced by clang allows up to 4GiB-sized TLS areas, whereas the code produced by gcc allows up to 16MiB-sized TLS areas. The relocations needed for the movz/movk sequence are not supported in gas nor ld.bfd.
After just having had a very quick glance, it seems that maybe the code generation in clang could be made more in line with what the gnu linker and assembler can accept by changing the following code from AArch64IselLowering.cpp:
SDValue
AArch64TargetLowering::LowerELFGlobalTLSAddress(SDValue Op,
SelectionDAG &DAG) const {
assert(Subtarget->isTargetELF() && "This function expects an ELF target");
assert(getTargetMachine().getCodeModel() == CodeModel::Small &&
"ELF TLS only supported in small memory model");
const GlobalAddressSDNode *GA = cast
TLSModel::Model Model = getTargetMachine().getTLSModel(GA->getGlobal());
SDValue TPOff; EVT PtrVT = getPointerTy(); SDLoc DL(Op); const GlobalValue *GV = GA->getGlobal();
SDValue ThreadBase = DAG.getNode(AArch64ISD::THREAD_POINTER, DL, PtrVT);
if (Model == TLSModel::LocalExec) { SDValue HiVar = DAG.getTargetGlobalAddress( GV, DL, PtrVT, 0, AArch64II::MO_TLS | AArch64II::MO_G1); SDValue LoVar = DAG.getTargetGlobalAddress( GV, DL, PtrVT, 0, AArch64II::MO_TLS | AArch64II::MO_G0 | AArch64II::MO_NC);
TPOff = SDValue(DAG.getMachineNode(AArch64::MOVZXi, DL, PtrVT, HiVar,
DAG.getTargetConstant(16, MVT::i32)),
0);
TPOff = SDValue(DAG.getMachineNode(AArch64::MOVKXi, DL, PtrVT, TPOff, LoVar,
DAG.getTargetConstant(0, MVT::i32)),
0);
The relevant regression tests seem to be: $ grep movz ../../../test/CodeGen/AArch64/-tls- ../../../test/CodeGen/AArch64/arm64-tls-dynamics.ll:; CHECK: movz [[DTP_OFFSET:x[0-9]+]], #:dtprel_g1:local_dynamic_var ../../../test/CodeGen/AArch64/arm64-tls-dynamics.ll:; CHECK: movz [[DTP_OFFSET:x[0-9]+]], #:dtprel_g1:local_dynamic_var ../../../test/CodeGen/AArch64/arm64-tls-execs.ll:; CHECK: movz [[TP_OFFSET:x[0-9]+]], #:tprel_g1:local_exec_var // encoding: [0bAAA{{[01]+}},A,0b101AAAAA,0x92] ../../../test/CodeGen/AArch64/arm64-tls-execs.ll:; CHECK: movz [[TP_OFFSET:x[0-9]+]], #:tprel_g1:local_exec_var
Chandler,
It seems that newer linkers may be able to bootstrap Clang, so that falls back on the issue of deprecating old tools, in this case, the linker.
If we need those features, we should at least wait for the next release, and not back-port this into 3.6.
cheers, --renato
Extended Description
When building clang/llvm on aarch64-linux, using a clang compiler, the following error is produced:
/usr/bin/ld: lib/libLLVMSupport.a(PrettyStackTrace.cpp.o): unrecognized relocation (0x20c) in section `.text._ZN4llvm21PrettyStackTraceEntryC2Ev' /usr/bin/ld: final link failed: Bad value clang-3.7: error: linker command failed with exit code 1 (use -v to see invocation)
This is because llvm recently started making use of a thread_local variable in PrettyStackTrace.cpp, and the AArch64 backend in llvm produces TLS relocations that are not supported by gnu ld.bfd.