llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.83k stars 11.91k forks source link

Addresses of TLS variables are kept alive across fiber/stack-full coroutine context switches which may result in a crash #98479

Open omern1 opened 3 months ago

omern1 commented 3 months ago

; llc --relocation-model=pic ./tls.ll

@tls_foo = hidden thread_local global ptr null, align 8

; Function Attrs: mustprogress nounwind sspstrong uwtable
define hidden noundef i32 @_Z4testv() #0 {
entry:
  %0 = call align 8 ptr @llvm.threadlocal.address.p0(ptr align 8 @tls_foo)
  %1 = load ptr, ptr %0, align 8
  call void @_Z8do_stuffPv(ptr noundef %1)
  %2 = call align 8 ptr @llvm.threadlocal.address.p0(ptr align 8 @tls_foo)
  %3 = load ptr, ptr %2, align 8
  call void @_Z8do_stuffPv(ptr noundef %3)
  ret i32 0
}

declare void @_Z8do_stuffPv(ptr noundef) #1

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare nonnull ptr @llvm.threadlocal.address.p0(ptr nonnull)

Output:

_Z4testv:                               # @_Z4testv
        .cfi_startproc
# %bb.0:                                # %entry
        pushq   %rbx
        .cfi_def_cfa_offset 16
        .cfi_offset %rbx, -16
        leaq    tls_foo@TLSLD(%rip), %rdi
        callq   __tls_get_addr@PLT
        movq    %rax, %rbx
        movq    tls_foo@DTPOFF(%rax), %rdi
        callq   _Z8do_stuffPv@PLT
; The address of tls_foo should be reloaded here.
        movq    tls_foo@DTPOFF(%rbx), %rdi
        callq   _Z8do_stuffPv@PLT
        xorl    %eax, %eax
        popq    %rbx
        .cfi_def_cfa_offset 8
        retq

In our (Sony's) private repro test is executing on a fiber created by an M:N fiber library (stack-full coroutines) and do_stuff causes the fiber to be suspended. When test resumes execution it's likely that it won't be running on the same OS thread as it was before it was suspended. This means that the TLS address loaded won't be valid on the new thread and will cause a crash.

MSVC has the /GT commandline option which makes it reload the TLS address before each load of a TLS variable and I think we probably need something similar in LLVM.

llvmbot commented 3 months ago

@llvm/issue-subscribers-coroutines

Author: Nabeel Omer (omern1)

``` c++ // Compile with clang -O3 -S -fpic ./tls.cpp thread_local void* tls_foo; extern void do_stuff(void*); int test() { do_stuff(tls_foo); do_stuff(tls_foo); return 0; } ``` Output: ``` _Z4testv: # @_Z4testv .cfi_startproc # %bb.0: # %entry pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset %rbx, -16 data16 leaq tls_foo@TLSGD(%rip), %rdi data16 data16 rex64 callq __tls_get_addr@PLT movq %rax, %rbx movq (%rax), %rdi callq _Z8do_stuffPv@PLT movq (%rbx), %rdi callq _Z8do_stuffPv@PLT xorl %eax, %eax popq %rbx .cfi_def_cfa_offset 8 retq ``` In our (Sony's) private repro `test` is executing on a fiber created by an M:N fiber library (stack-full coroutines) and `do_stuff` causes the fiber to be suspended. When `test` resumes execution it's likely that it won't be running on the same OS thread as it was before it was suspended. This means that the TLS address loaded won't be valid on the new thread and will cause a crash. [MSVC has the `/GT` commandline option](https://learn.microsoft.com/en-us/cpp/build/reference/gt-support-fiber-safe-thread-local-storage) which makes it reload the TLS address before each load of a TLS variable and I think we probably need something similar in LLVM.
omern1 commented 3 months ago

Also adding the SelectionDAG label because the "de-duplication" of the TLS address load happens there.

=== _Z4testv

Initial selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 24 nodes:
  t2: i64 = Constant<0>
    t0: ch,glue = EntryToken
  t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
  t5: i64 = GlobalAddress<ptr @_Z8do_stuffPv> 0
    t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
  t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
    t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
  t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
    t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
  t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
    t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
  t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
  t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1

Optimized lowered selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 22 nodes:
    t0: ch,glue = EntryToken
  t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
    t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
  t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
    t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
  t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
    t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
  t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
    t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
  t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
  t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1

Type-legalized selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 22 nodes:
    t0: ch,glue = EntryToken
  t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
    t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
  t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
    t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
  t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
    t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
  t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
    t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
  t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
  t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1

Legalized selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 28 nodes:
  t0: ch,glue = EntryToken
  t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, t30, undef:i64
    t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
  t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
    t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
  t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, t30, undef:i64
    t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
  t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
    t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
  t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
  t25: ch,glue = X86ISD::TLSBASEADDR t0, TargetGlobalTLSAddress:i64<ptr @tls_foo> 0 [TF=9]
    t29: i64 = X86ISD::Wrapper TargetGlobalTLSAddress:i64<ptr @tls_foo> 0 [TF=14]
    t27: i64,ch,glue = CopyFromReg t25, Register:i64 $rax, t25:1
  t30: i64 = add t29, t27
  t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1

Optimized legalized selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 28 nodes:
  t0: ch,glue = EntryToken
  t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, t30, undef:i64
    t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
  t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
    t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
  t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, t30, undef:i64
    t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
  t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
  t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
    t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
  t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
  t25: ch,glue = X86ISD::TLSBASEADDR t0, TargetGlobalTLSAddress:i64<ptr @tls_foo> 0 [TF=9]
    t29: i64 = X86ISD::Wrapper TargetGlobalTLSAddress:i64<ptr @tls_foo> 0 [TF=14]
    t27: i64,ch,glue = CopyFromReg t25, Register:i64 $rax, t25:1
  t30: i64 = add t29, t27
  t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1

Selected selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 28 nodes:
  t0: ch,glue = EntryToken
  t25: i64,ch,glue = TLS_base_addr64 Register:i64 $noreg, TargetConstant:i8<1>, Register:i64 $noreg, TargetGlobalTLSAddress:i32<ptr @tls_foo> 0 [TF=9], Register:i16 $noreg, t0
  t27: i64,ch,glue = CopyFromReg t25:1, Register:i64 $rax, t25:2
  t4: i64,ch = MOV64rm<Mem:(dereferenceable load (s64) from %ir.0)> t27, TargetConstant:i8<1>, Register:i64 $noreg, TargetGlobalTLSAddress:i32<ptr @tls_foo> 0 [TF=14], Register:i16 $noreg, t0
    t7: i64,ch,glue = ADJCALLSTACKDOWN64 TargetConstant:i64<0>, TargetConstant:i64<0>, TargetConstant:i32<0>, t4:1
  t9: ch,glue = CopyToReg t7:1, Register:i64 $rdi, t4
  t12: ch,glue = CALL64pcrel32 TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9, t9:1
    t13: i64,ch,glue = ADJCALLSTACKUP64 TargetConstant:i64<0>, TargetConstant:i64<0>, t12, t12:1
  t14: i64,ch = MOV64rm<Mem:(dereferenceable load (s64) from %ir.2)> t27, TargetConstant:i8<1>, Register:i64 $noreg, TargetGlobalTLSAddress:i32<ptr @tls_foo> 0 [TF=14], Register:i16 $noreg, t13:1
    t15: i64,ch,glue = ADJCALLSTACKDOWN64 TargetConstant:i64<0>, TargetConstant:i64<0>, TargetConstant:i32<0>, t14:1
  t16: ch,glue = CopyToReg t15:1, Register:i64 $rdi, t14
  t17: ch,glue = CALL64pcrel32 TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16, t16:1
    t18: i64,ch,glue = ADJCALLSTACKUP64 TargetConstant:i64<0>, TargetConstant:i64<0>, t17, t17:1
  t22: ch,glue = CopyToReg t18:1, Register:i32 $eax, MOV32r0:i32,i32
  t23: ch = RET TargetConstant:i32<0>, Register:i32 $eax, t22, t22:1

*** MachineFunction at end of ISel ***
# Machine code for function _Z4testv: IsSSA, TracksLiveness

bb.0.entry:
  TLS_base_addr64 $noreg, 1, $noreg, target-flags(x86-tlsld) @tls_foo, $noreg, implicit-def $rax, implicit-def dead $rcx, implicit-def dead $rdx, implicit-def dead $rsi, implicit-def dead $rdi, implicit-def dead $r8, implicit-def dead $r9, implicit-def dead $r10, implicit-def dead $r11, implicit-def dead $fp0, implicit-def dead $fp1, implicit-def dead $fp2, implicit-def dead $fp3, implicit-def dead $fp4, implicit-def dead $fp5, implicit-def dead $fp6, implicit-def dead $fp7, implicit-def dead $st0, implicit-def dead $st1, implicit-def dead $st2, implicit-def dead $st3, implicit-def dead $st4, implicit-def dead $st5, implicit-def dead $st6, implicit-def dead $st7, implicit-def dead $mm0, implicit-def dead $mm1, implicit-def dead $mm2, implicit-def dead $mm3, implicit-def dead $mm4, implicit-def dead $mm5, implicit-def dead $mm6, implicit-def dead $mm7, implicit-def dead $xmm0, implicit-def dead $xmm1, implicit-def dead $xmm2, implicit-def dead $xmm3, implicit-def dead $xmm4, implicit-def dead $xmm5, implicit-def dead $xmm6, implicit-def dead $xmm7, implicit-def dead $xmm8, implicit-def dead $xmm9, implicit-def dead $xmm10, implicit-def dead $xmm11, implicit-def dead $xmm12, implicit-def dead $xmm13, implicit-def dead $xmm14, implicit-def dead $xmm15, implicit-def dead $eflags, implicit-def dead $df, implicit $rsp, implicit $ssp
  %0:gr64 = COPY $rax
  %1:gr64 = MOV64rm %0:gr64, 1, $noreg, target-flags(x86-dtpoff) @tls_foo, $noreg :: (dereferenceable load (s64) from %ir.0)
  ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
  $rdi = COPY %1:gr64
  CALL64pcrel32 target-flags(x86-plt) @_Z8do_stuffPv, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp
  ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
  %2:gr64 = MOV64rm %0:gr64, 1, $noreg, target-flags(x86-dtpoff) @tls_foo, $noreg :: (dereferenceable load (s64) from %ir.2)
  ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
  $rdi = COPY %2:gr64
  CALL64pcrel32 target-flags(x86-plt) @_Z8do_stuffPv, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp
  ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
  %3:gr32 = MOV32r0 implicit-def dead $eflags
  $eax = COPY %3:gr32
  RET 0, $eax

# End machine code for function _Z4testv.
jyknight commented 3 months ago

This is not a bug. Thread-switching stackful coroutines are not supported. Do not switch threads behind the compiler's back.

Maybe it could be investigated as a new compiler feature. But only maybe -- IMO it is just fundamentally a bad idea and unnecessary.

Fulgen301 commented 3 months ago

MSVC doesn't have /GT because somebody wanted to support some arcane usecase - it's required to make the stackful coroutines that Windows provides since Windows XP - fibers - work with thread-local storage:

Because a fiber may get swapped out and restarted later on a different thread, the compiler mustn't cache the address of the TLS array, or optimize it as a common subexpression across a function call.

Do not switch threads behind the compiler's back.

This is already possible using regular stackless C++20 coroutines, for which that optimization is properly disabled (https://github.com/llvm/llvm-project/issues/63022), and a /GT equivalent and / or support for /GT in clang-cl would also be needed for compatibility with MSVC.

jyknight commented 3 months ago

C++20 coroutine suspend points are visible to the compiler, with explicitly annotated locally-visible suspend points. And after initial codegen passes, they have been transformed into linear code flow that cannot suspend (only return). So this is a local concern only of that one function body.

This is rather different from saying all functions in the entire program might change their thread at any call. At all stages of codegen through machine code.

Re: Windows fibers, they are not widely used, and there's multiple articles from MS dis-recommending them. Not every feature msvc has necessarily needs to be supported in clang.

ChuanqiXu9 commented 3 months ago

Yeah, for stackful coroutines, if we don't want to disable optimizations for the entire project due to a single use of a stackful coroutine, we can only do some IPA for that. But on the one hand, IPA is hard. On the hand, we have to be conservative, so that any uncertain results will be seen as "don't optimize the function, please". I feel the benefit and the cost may not be balanced.