clang no longer bootstraps itself on AArch64 linux

kbeyls commented 9 years ago


Bugzilla Link	22408
Resolution	FIXED
Resolved on	Mar 06, 2015 02:28
Version	trunk
OS	Linux
CC	@chandlerc,@echristo,@rengolin,@TNorthover

Extended Description

When building clang/llvm on aarch64-linux, using a clang compiler, the following error is produced:

/usr/bin/ld: lib/libLLVMSupport.a(PrettyStackTrace.cpp.o): unrecognized relocation (0x20c) in section `.text._ZN4llvm21PrettyStackTraceEntryC2Ev' /usr/bin/ld: final link failed: Bad value clang-3.7: error: linker command failed with exit code 1 (use -v to see invocation)

This is because llvm recently started making use of a thread_local variable in PrettyStackTrace.cpp, and the AArch64 backend in llvm produces TLS relocations that are not supported by gnu ld.bfd.

kbeyls commented 9 years ago

Fixed in http://llvm.org/viewvc/llvm-project?view=revision&revision=231227

rengolin commented 9 years ago

I still very firmly disagree. If the linker rejects a valid ELF file because it's got itself in a twist and can't undo that, that's a linker bug.

I share Tim's opinion. However, even fixing this bug today in the linker (it seems the fix is in progress), it still won't work for the next distributions to come, that use current linkers.

Would it be possible to do the hack in the compiler today (with a very big FIXME), set this bug as a blocker to 3.7 and make sure we can re-enable it in the future when the linker is fixed?

If necessary, I can liaise with the Linaro GNU developers to backport that fix onto the next stable binutils, so that distros that use our toolchain could benefit without having to wait several months to get the proper fix.

TNorthover commented 9 years ago

Oh, and if you can't actually fix the linker an error is strongly preferred to a silent codegen failure (rule #0 of compiler development).

TNorthover commented 9 years ago

Considering the nature of linker flow, compiler output has to match the TLS access patterns assumed by linker.

I still very firmly disagree. If the linker rejects a valid ELF file because it's got itself in a twist and can't undo that, that's a linker bug.

Maybe the least bad solution is for the compiler to accommodate the linker, but let's be under no illusions here: the reason will be that the linker has a broken design that's more difficult to fix than the compiler. Not that it's better for everyone concerned.

llvmbot commented 9 years ago

In the linker TLSDESC optimization, linker uses register r0 because the result of the TLSDESC access sequence must be held on r0 (returned by blr). Ideally, if linker is able to put the resulting "movz; movk" instructions in the place of original "blr", then it can accommodate any access sequence. However, for current aarch64 linker, it is almost impossible to squeeze 2 instructions into one slot without changing linker's work flow.

Linker can check TLS access sequence before doing TLS relaxation. It does so on GD->LE relaxation. We can add checks on TLSDESC too. However, the bad news is, when linker finds the tls access sequence is not all right, it will fail and exit, reporting an error. It can not go back and do relocation without TLS relaxation. The reason is that linker allocates GOT and PLT spaces at very early stage, where the TLS relaxation decision has to be made. When linker comes to do the real relocation work, it is too late to change the decision since this symbol does not have a space in GOT or PLT at all.

Clang generates different TLS access sequences than GCC. Before I was aware of this post, I have seen "adrp; ldr; add; blr", or "adrp; add; adrp; ldr; blr", or "adrp; ldr; adrp; add; blr". Linker can accommodate them well, no matter what register adrp is using.

IMHO, both compiler and linker should work together to achieve better performance in a whole. Considering the nature of linker flow, compiler output has to match the TLS access patterns assumed by linker. If there are new patterns that make sense, we can add those patterns into linker.

TNorthover commented 9 years ago

I think that scheduling issue is something we should push back extremely firmly about.

Actually, that was too hasty. The linker relaxation appears to be a very real optimisation too, possibly greater than the compiler's in some cases.

I think the correct way to word it would be a suggestion though ("compilers are advised to produce this exact sequence to aid linker optimisation; linkers that see this sequence may...").

There's still no excuse for not processing relocations correctly as seen. It's a buggy optimisation plain and simple. It's like us telling programmers to declare their variables volatile because "we've not yet fully determined what optimisations may be possible" and then blaming them when they get bad code for non-volatiles.

The linker should bail if the sequence isn't exactly as written (the data-flow is completely self-contained, so this is possible with a local analysis).

TNorthover commented 9 years ago

(Of course, there's also GAS support if you care about such things).

TNorthover commented 9 years ago

It looks like that covers the extra relocations we emit (though Kristof has more recent experience with exhaustive testing of this).

rengolin commented 9 years ago

Would that help?

https://sourceware.org/ml/binutils/2015-02/msg00140.html

TNorthover commented 9 years ago

I think that scheduling issue is something we should push back extremely firmly about.

First, it defeats the entire purpose separate ELF relocations. If they have to be in a block, the specification should have just created an R_AARCH64_HORRIFIC_TLS_HACK covering the 5 instructions (or whatever) and used that.

Second, we have a real compiler optimisation this is preventing, in the name of some hypothetical future linker change that there aren't even plans for. One implemented based on real public specifications.

Basically, I think that document should never see the light of day.

kbeyls commented 9 years ago

FURTHER ANALYSIS:

One further reason why thread_local variables don't work on AArch64-linux for LLVM-generated code is that the current draft specification states:

"The compiler is not allowed to schedule the sequences below. This restriction is present because we have not yet fully determined what linker optimisations may be possible. In order to facilitate adding linker optimisations in the future, without recompiling current code, the compiler is restricted from scheduling these sequences."

In other words, current linkers are allowed to assume that accesses to TLS variables are done by one of the following exact sequences (I've only listed the sequences for the small code model with a 16MiB TLS area size and -mtls-dialect=desc):

Where ... appears in a code sequence the compiler may insert zero or more arbitrary instructions. All references to registers, other than x0 used as an argument or return value, are arbitrary and the compiler is free to use any register.

General Dynamic:

  adrp  x0, :tlsdesc:var                        R_AARCH64_TLSDESC_ADR_PAGE21 var
  ldr   x1, [x0, #:tlsdesc_lo12:var]            R_AARCH64_TLSDESC_LD64_LO12 var
  add   x0, x0, #:tlsdesc_lo12:var              R_AARCH64_TLSDESC_ADD_LO12  var
  .tlsdesccall var
  blr   x1                                      R_AARCH64_TLSDESC_CALL      var
...
  mrs   tp, tpidr_el0
  add   x0, tp, x0
or
  mrs   tp, tpidr_el0
  ldr   x0, [tp, x0]

Local Dynamic: See General Dynamic. "To avoid the definition of new relocations, the linker simply defines for all modules that have a TLS section a hidden per-module symbol called _TLS_MODULEBASE which denotes the beginning of the TLS section for that module."

Initial Exec:

    0x00 mrs  tp, tpidr_el0 
    ...
    0x04 adrp t0, :gottprel:x1                  R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21   x1
    0x08 ldr  t0, [t0, #:gottprel_lo12:x1]      R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC x1
    ...
    0x0c add  t0, tp

Local Exec:

    0x00 mrs  tp, tpidr_el0
    ...
    0x20 add  t0, tp, #:tprel_hi12:x1, lsl #&#8203;12  R_AARCH64_TLSLE_ADD_TPREL_HI12       x2
    0x24 add  t0, #:tprel_lo12_nc:x1            R_AARCH64_TLSLE_ADD_TPREL_LO12_NC    x2

   (instead of the last add, the following can be used if the value of the variable,
    instead of its adress, is needed:
    0x24 ldr  t0, [t0, #:tprel_lo12_nc:x1]      R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC x2
   )

At the moment, LLVM schedules the instruction sequences produced, leading to mis-"relaxation" in the linker. An example, somewhat derived and simplified from how the thread_local variable in lib/Support/PrettyStackTrace.cpp is used, is:

C code:

extern _Thread_local void* t;
int main() {
  void* res=0;
  for (int i=0;i<100; ++i)
    res += (int)&t;
  return (int)res;
}

objdump -rd of the clang-produced object file. The marked lines show that part of the TLS access code sequence has been lifted out of the loop, and it’s value "cached" in register x9.

0000000000000000 <main>:
   0:   a9bf7bfd        stp     x29, x30, [sp,#-16]!
   4:   910003fd        mov     x29, sp
   8:   aa1f03e8        mov     x8, xzr
   c:   52800c8a        mov     w10, #&#8203;0x64                      // #&#8203;100
* 10:   90000009        adrp    x9, 0 <t>
*                       10: R_AARCH64_TLSDESC_ADR_PAGE21        t
* 14:   91000129        add     x9, x9, #&#8203;0x0
*                       14: R_AARCH64_TLSDESC_ADD_LO12_NC       t
  18:   9000000b        adrp    x11, 0 <t>
                        18: R_AARCH64_TLSDESC_ADR_PAGE21        t
  1c:   f940016b        ldr     x11, [x11]
                        1c: R_AARCH64_TLSDESC_LD64_LO12_NC      t
  20:   aa0903e0        mov     x0, x9
  24:   d63f0160        blr     x11
                        24: R_AARCH64_TLSDESC_CALL      t
  28:   d53bd04c        mrs     x12, tpidr_el0
  2c:   0b00018c        add     w12, w12, w0
  30:   8b2cc108        add     x8, x8, w12, sxtw
  34:   5100054a        sub     w10, w10, #&#8203;0x1
  38:   35ffff4a        cbnz    w10, 20 <main+0x20>
  3c:   2a0803e0        mov     w0, w8
  40:   a8c17bfd        ldp     x29, x30, [sp],#16
  44:   d65f03c0        ret

ld then misoptimizes/mis-relaxes this code by assuming that always register x0 is used in the presence of these TLS relocations. That probably is a fair assumption in case the sequence of instructions have to remain unscheduled, but seems to be a broken assumption when a compiler optimizes/reschedules the code sequences, which is what LLVM has done above.

0000000000400650 <main>:
  400650:       a9bf7bfd        stp     x29, x30, [sp,#-16]!
  400654:       910003fd        mov     x29, sp
  400658:       aa1f03e8        mov     x8, xzr
  40065c:       52800c8a        mov     w10, #&#8203;0x64                      // #&#8203;100
* 400660:       d2a00000        movz    x0, #&#8203;0x0, lsl #&#8203;16
* 400664:       d503201f        nop
  400668:       d2a00000        movz    x0, #&#8203;0x0, lsl #&#8203;16
  40066c:       f2800200        movk    x0, #&#8203;0x10
  400670:       aa0903e0        mov     x0, x9
  400674:       d503201f        nop
  400678:       d53bd04c        mrs     x12, tpidr_el0
  40067c:       0b00018c        add     w12, w12, w0
  400680:       8b2cc108        add     x8, x8, w12, sxtw
  400684:       5100054a        sub     w10, w10, #&#8203;0x1
  400688:       35ffff4a        cbnz    w10, 400670 <main+0x20>
  40068c:       2a0803e0        mov     w0, w8
  400690:       a8c17bfd        ldp     x29, x30, [sp],#16
  400694:       d65f03c0        ret

Bottom line: the specification states that the instruction sequences must not be rescheduled, and the linkers make assumptions based on that.

TNorthover commented 9 years ago

This seems like more of a GNU bug: these relocations are in the ABI.

That said, we could probably work around it while improving LLVM at the same time. Instead of ripping out the existing code, we should make it the switchable option that was intended. Changing the default to match GCC wouldn't be unreasonable either.

IA-64 seems to have a similar issue and use "-mtls-size" (not in LLVM, obviously). I was sure there was another platform with the issue too (PPC?), but can't find anything searching now. Maybe I'm just remembering -mpic vs -mPIC.

kbeyls commented 9 years ago

t.gcc491_PIC_trad.s

kbeyls commented 9 years ago

t.gcc491_PIC_desc.s

kbeyls commented 9 years ago

t.gcc491_noPIC_trad.s

kbeyls commented 9 years ago

t.gcc491_noPIC_desc.s

kbeyls commented 9 years ago

t.clang-3.5_PIC.s

kbeyls commented 9 years ago

t.clang-3.5_noPIC.s

kbeyls commented 9 years ago

It seems that some of the TLS relocations that clang produces right now just aren't supported by ld.bfd, and maybe some of them may also not be supported by ld.gold.

My understanding is that the GNU toolchain currently supports only the following code model with TLS: small PIC + TLS(16MiB).

It seems that the most practical solution for the time being (until linkers available in the most common distributions do support more TLS relocations), is to make sure that clang only produces what ld.bfd & ld.gold support on ELF-based systems.

In an attempt to see what gcc actually produces, and how clang produces different code, I've created & run the following short python script: import os f = open("t.c", "w") f.write(""" extern _Thread_local int x; static _Threadlocal int y; void f1(int*a) { a=&x; } void f2(inta) { a=x; } void f3(int*a) { a=&y; } void f4(inta) { a=y; } """) f.close() for fpicLabel, fpicOption in (("PIC","-fPIC"),("noPIC","")): cmd = "clang-3.5 %(fpicOption)s -O3 -std=c11 t.c -c -S -o- > t.clang-3.5%(fpicLabel)s.s" % locals() os.system(cmd) for tlsDialect in ("trad","desc"): cmd = "gcc %(fpicOption)s -O3 -std=c11 t.c -c -mtls-dialect=%(tlsDialect)s -S -o- > t.gcc491%(fpicLabel)s%(tlsDialect)s.s" % locals() os.system(cmd)

This produces the .s files in attachment.

Looking at the main difference between the gcc- and clang-produce .s files, it seems that the main difference is in function f3/f4 accessing static thread_local variables:

gcc produces: f3: mrs x1, tpidr_el0 add x1, x1, #:tprel_hi12:.LANCHOR0 add x1, x1, #:tprel_lo12_nc:.LANCHOR0 str x1, [x0]

clang produces: f3: // @f3 // BB#0: movz x8, #:tprel_g1:y movk x8, #:tprel_g0_nc:y mrs x9, TPIDR_EL0 add x8, x9, x8 str x8, [x0]

My understanding is the the sequence produced by clang allows up to 4GiB-sized TLS areas, whereas the code produced by gcc allows up to 16MiB-sized TLS areas. The relocations needed for the movz/movk sequence are not supported in gas nor ld.bfd.

After just having had a very quick glance, it seems that maybe the code generation in clang could be made more in line with what the gnu linker and assembler can accept by changing the following code from AArch64IselLowering.cpp:

SDValue AArch64TargetLowering::LowerELFGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const { assert(Subtarget->isTargetELF() && "This function expects an ELF target"); assert(getTargetMachine().getCodeModel() == CodeModel::Small && "ELF TLS only supported in small memory model"); const GlobalAddressSDNode *GA = cast(Op);

TLSModel::Model Model = getTargetMachine().getTLSModel(GA->getGlobal());

SDValue TPOff; EVT PtrVT = getPointerTy(); SDLoc DL(Op); const GlobalValue *GV = GA->getGlobal();

SDValue ThreadBase = DAG.getNode(AArch64ISD::THREAD_POINTER, DL, PtrVT);

if (Model == TLSModel::LocalExec) { SDValue HiVar = DAG.getTargetGlobalAddress( GV, DL, PtrVT, 0, AArch64II::MO_TLS | AArch64II::MO_G1); SDValue LoVar = DAG.getTargetGlobalAddress( GV, DL, PtrVT, 0, AArch64II::MO_TLS | AArch64II::MO_G0 | AArch64II::MO_NC);

TPOff = SDValue(DAG.getMachineNode(AArch64::MOVZXi, DL, PtrVT, HiVar,
                                   DAG.getTargetConstant(16, MVT::i32)),
                0);
TPOff = SDValue(DAG.getMachineNode(AArch64::MOVKXi, DL, PtrVT, TPOff, LoVar,
                                   DAG.getTargetConstant(0, MVT::i32)),
                0);

The relevant regression tests seem to be: $ grep movz ../../../test/CodeGen/AArch64/-tls- ../../../test/CodeGen/AArch64/arm64-tls-dynamics.ll:; CHECK: movz [[DTP_OFFSET:x[0-9]+]], #:dtprel_g1:local_dynamic_var ../../../test/CodeGen/AArch64/arm64-tls-dynamics.ll:; CHECK: movz [[DTP_OFFSET:x[0-9]+]], #:dtprel_g1:local_dynamic_var ../../../test/CodeGen/AArch64/arm64-tls-execs.ll:; CHECK: movz [[TP_OFFSET:x[0-9]+]], #:tprel_g1:local_exec_var // encoding: [0bAAA{{[01]+}},A,0b101AAAAA,0x92] ../../../test/CodeGen/AArch64/arm64-tls-execs.ll:; CHECK: movz [[TP_OFFSET:x[0-9]+]], #:tprel_g1:local_exec_var

rengolin commented 9 years ago

Chandler,

It seems that newer linkers may be able to bootstrap Clang, so that falls back on the issue of deprecating old tools, in this case, the linker.

If we need those features, we should at least wait for the next release, and not back-port this into 3.6.

cheers, --renato

llvm / llvm-project

clang no longer bootstraps itself on AArch64 linux #22782

Extended Description