Open omern1 opened 3 months ago
@llvm/issue-subscribers-coroutines
Author: Nabeel Omer (omern1)
Also adding the SelectionDAG label because the "de-duplication" of the TLS address load happens there.
=== _Z4testv
Initial selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 24 nodes:
t2: i64 = Constant<0>
t0: ch,glue = EntryToken
t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
t5: i64 = GlobalAddress<ptr @_Z8do_stuffPv> 0
t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1
Optimized lowered selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 22 nodes:
t0: ch,glue = EntryToken
t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1
Type-legalized selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 22 nodes:
t0: ch,glue = EntryToken
t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, GlobalTLSAddress:i64<ptr @tls_foo> 0, undef:i64
t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1
Legalized selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 28 nodes:
t0: ch,glue = EntryToken
t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, t30, undef:i64
t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, t30, undef:i64
t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
t25: ch,glue = X86ISD::TLSBASEADDR t0, TargetGlobalTLSAddress:i64<ptr @tls_foo> 0 [TF=9]
t29: i64 = X86ISD::Wrapper TargetGlobalTLSAddress:i64<ptr @tls_foo> 0 [TF=14]
t27: i64,ch,glue = CopyFromReg t25, Register:i64 $rax, t25:1
t30: i64 = add t29, t27
t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1
Optimized legalized selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 28 nodes:
t0: ch,glue = EntryToken
t4: i64,ch = load<(dereferenceable load (s64) from %ir.0)> t0, t30, undef:i64
t7: ch,glue = callseq_start t4:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t9: ch,glue = CopyToReg t7, Register:i64 $rdi, t4
t12: ch,glue = X86ISD::CALL t9, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9:1
t13: ch,glue = callseq_end t12, TargetConstant:i64<0>, TargetConstant:i64<0>, t12:1
t14: i64,ch = load<(dereferenceable load (s64) from %ir.2)> t13, t30, undef:i64
t15: ch,glue = callseq_start t14:1, TargetConstant:i64<0>, TargetConstant:i64<0>
t16: ch,glue = CopyToReg t15, Register:i64 $rdi, t14
t17: ch,glue = X86ISD::CALL t16, TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16:1
t18: ch,glue = callseq_end t17, TargetConstant:i64<0>, TargetConstant:i64<0>, t17:1
t22: ch,glue = CopyToReg t18, Register:i32 $eax, Constant:i32<0>
t25: ch,glue = X86ISD::TLSBASEADDR t0, TargetGlobalTLSAddress:i64<ptr @tls_foo> 0 [TF=9]
t29: i64 = X86ISD::Wrapper TargetGlobalTLSAddress:i64<ptr @tls_foo> 0 [TF=14]
t27: i64,ch,glue = CopyFromReg t25, Register:i64 $rax, t25:1
t30: i64 = add t29, t27
t23: ch = X86ISD::RET_GLUE t22, TargetConstant:i32<0>, Register:i32 $eax, t22:1
Selected selection DAG: %bb.0 '_Z4testv:entry'
SelectionDAG has 28 nodes:
t0: ch,glue = EntryToken
t25: i64,ch,glue = TLS_base_addr64 Register:i64 $noreg, TargetConstant:i8<1>, Register:i64 $noreg, TargetGlobalTLSAddress:i32<ptr @tls_foo> 0 [TF=9], Register:i16 $noreg, t0
t27: i64,ch,glue = CopyFromReg t25:1, Register:i64 $rax, t25:2
t4: i64,ch = MOV64rm<Mem:(dereferenceable load (s64) from %ir.0)> t27, TargetConstant:i8<1>, Register:i64 $noreg, TargetGlobalTLSAddress:i32<ptr @tls_foo> 0 [TF=14], Register:i16 $noreg, t0
t7: i64,ch,glue = ADJCALLSTACKDOWN64 TargetConstant:i64<0>, TargetConstant:i64<0>, TargetConstant:i32<0>, t4:1
t9: ch,glue = CopyToReg t7:1, Register:i64 $rdi, t4
t12: ch,glue = CALL64pcrel32 TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t9, t9:1
t13: i64,ch,glue = ADJCALLSTACKUP64 TargetConstant:i64<0>, TargetConstant:i64<0>, t12, t12:1
t14: i64,ch = MOV64rm<Mem:(dereferenceable load (s64) from %ir.2)> t27, TargetConstant:i8<1>, Register:i64 $noreg, TargetGlobalTLSAddress:i32<ptr @tls_foo> 0 [TF=14], Register:i16 $noreg, t13:1
t15: i64,ch,glue = ADJCALLSTACKDOWN64 TargetConstant:i64<0>, TargetConstant:i64<0>, TargetConstant:i32<0>, t14:1
t16: ch,glue = CopyToReg t15:1, Register:i64 $rdi, t14
t17: ch,glue = CALL64pcrel32 TargetGlobalAddress:i64<ptr @_Z8do_stuffPv> 0 [TF=7], Register:i64 $rdi, RegisterMask:Untyped, t16, t16:1
t18: i64,ch,glue = ADJCALLSTACKUP64 TargetConstant:i64<0>, TargetConstant:i64<0>, t17, t17:1
t22: ch,glue = CopyToReg t18:1, Register:i32 $eax, MOV32r0:i32,i32
t23: ch = RET TargetConstant:i32<0>, Register:i32 $eax, t22, t22:1
*** MachineFunction at end of ISel ***
# Machine code for function _Z4testv: IsSSA, TracksLiveness
bb.0.entry:
TLS_base_addr64 $noreg, 1, $noreg, target-flags(x86-tlsld) @tls_foo, $noreg, implicit-def $rax, implicit-def dead $rcx, implicit-def dead $rdx, implicit-def dead $rsi, implicit-def dead $rdi, implicit-def dead $r8, implicit-def dead $r9, implicit-def dead $r10, implicit-def dead $r11, implicit-def dead $fp0, implicit-def dead $fp1, implicit-def dead $fp2, implicit-def dead $fp3, implicit-def dead $fp4, implicit-def dead $fp5, implicit-def dead $fp6, implicit-def dead $fp7, implicit-def dead $st0, implicit-def dead $st1, implicit-def dead $st2, implicit-def dead $st3, implicit-def dead $st4, implicit-def dead $st5, implicit-def dead $st6, implicit-def dead $st7, implicit-def dead $mm0, implicit-def dead $mm1, implicit-def dead $mm2, implicit-def dead $mm3, implicit-def dead $mm4, implicit-def dead $mm5, implicit-def dead $mm6, implicit-def dead $mm7, implicit-def dead $xmm0, implicit-def dead $xmm1, implicit-def dead $xmm2, implicit-def dead $xmm3, implicit-def dead $xmm4, implicit-def dead $xmm5, implicit-def dead $xmm6, implicit-def dead $xmm7, implicit-def dead $xmm8, implicit-def dead $xmm9, implicit-def dead $xmm10, implicit-def dead $xmm11, implicit-def dead $xmm12, implicit-def dead $xmm13, implicit-def dead $xmm14, implicit-def dead $xmm15, implicit-def dead $eflags, implicit-def dead $df, implicit $rsp, implicit $ssp
%0:gr64 = COPY $rax
%1:gr64 = MOV64rm %0:gr64, 1, $noreg, target-flags(x86-dtpoff) @tls_foo, $noreg :: (dereferenceable load (s64) from %ir.0)
ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
$rdi = COPY %1:gr64
CALL64pcrel32 target-flags(x86-plt) @_Z8do_stuffPv, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp
ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
%2:gr64 = MOV64rm %0:gr64, 1, $noreg, target-flags(x86-dtpoff) @tls_foo, $noreg :: (dereferenceable load (s64) from %ir.2)
ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
$rdi = COPY %2:gr64
CALL64pcrel32 target-flags(x86-plt) @_Z8do_stuffPv, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp
ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
%3:gr32 = MOV32r0 implicit-def dead $eflags
$eax = COPY %3:gr32
RET 0, $eax
# End machine code for function _Z4testv.
This is not a bug. Thread-switching stackful coroutines are not supported. Do not switch threads behind the compiler's back.
Maybe it could be investigated as a new compiler feature. But only maybe -- IMO it is just fundamentally a bad idea and unnecessary.
MSVC doesn't have /GT
because somebody wanted to support some arcane usecase - it's required to make the stackful coroutines that Windows provides since Windows XP - fibers - work with thread-local storage:
Because a fiber may get swapped out and restarted later on a different thread, the compiler mustn't cache the address of the TLS array, or optimize it as a common subexpression across a function call.
Do not switch threads behind the compiler's back.
This is already possible using regular stackless C++20 coroutines, for which that optimization is properly disabled (https://github.com/llvm/llvm-project/issues/63022), and a /GT
equivalent and / or support for /GT
in clang-cl would also be needed for compatibility with MSVC.
C++20 coroutine suspend points are visible to the compiler, with explicitly annotated locally-visible suspend points. And after initial codegen passes, they have been transformed into linear code flow that cannot suspend (only return). So this is a local concern only of that one function body.
This is rather different from saying all functions in the entire program might change their thread at any call. At all stages of codegen through machine code.
Re: Windows fibers, they are not widely used, and there's multiple articles from MS dis-recommending them. Not every feature msvc has necessarily needs to be supported in clang.
Yeah, for stackful coroutines, if we don't want to disable optimizations for the entire project due to a single use of a stackful coroutine, we can only do some IPA for that. But on the one hand, IPA is hard. On the hand, we have to be conservative, so that any uncertain results will be seen as "don't optimize the function, please". I feel the benefit and the cost may not be balanced.
Output:
In our (Sony's) private repro
test
is executing on a fiber created by an M:N fiber library (stack-full coroutines) anddo_stuff
causes the fiber to be suspended. Whentest
resumes execution it's likely that it won't be running on the same OS thread as it was before it was suspended. This means that the TLS address loaded won't be valid on the new thread and will cause a crash.MSVC has the
/GT
commandline option which makes it reload the TLS address before each load of a TLS variable and I think we probably need something similar in LLVM.