Open xen0n opened 7 months ago
cc @MQ-mengqing @heiher @xry111 @MaskRay
Issues to be resolved (IMO):
jirl $a1, $a1, %desc_call(var)
so a function won't have to save $ra only because it uses TLS descriptor. But OTOH using another register might puzzle HW return address predictor.movcf2gr $t0,$fcc0
movcf2gr $t1,$fcc1
bstrins.w $t0,$t1,1,1
movcf2gr $t1,$fcc2
bstrins.w $t0,$t1,2,2
# ...
st.d $t0,$sp,OFFSET_FCC
movcf2gr $t0,$fcc0 movcf2gr $t1,$fcc1 bstrins.w $t0,$t1,1,1 movcf2gr $t1,$fcc2 bstrins.w $t0,$t1,2,2 # ... st.d $t0,$sp,OFFSET_FCC
@xen0n: How did you handle this for in-kernel FPU usage? The situation is very similar to a context switch (as Florian Weimer said).
movcf2gr $t0,$fcc0 movcf2gr $t1,$fcc1 bstrins.w $t0,$t1,1,1 movcf2gr $t1,$fcc2 bstrins.w $t0,$t1,2,2 # ... st.d $t0,$sp,OFFSET_FCC
@xen0n: How did you handle this for in-kernel FPU usage? The situation is very similar to a context switch (as Florian Weimer said).
The kernel just does the equivalent of a FP context switch when entering/exiting in-kernel FPU critical sections.
st.d $t0,$sp,OFFSET_FCC
This should be "st.b" to be optimal.
Seems the second point in [1] break the viewpoint in [2]. I noticed that the mold author Rui said "I'd stick with the usual two-slot design". They who prefer 2-slot design raised enough reasons. And I don't how will musl implement it is acceptable. I'll bring my question up in the coming internal meeting.
[1] https://sourceware.org/pipermail/binutils/2023-December/130916.html [2] https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/373#issuecomment-1668982387
Seems the second point in [1] break the viewpoint in [2]. I noticed that the mold author Rui said "I'd stick with the usual two-slot design". They who prefer 2-slot design raised enough reasons. And I don't how will musl implement it is acceptable. I'll bring my question up in the coming internal meeting.
[1] https://sourceware.org/pipermail/binutils/2023-December/130916.html [2] riscv-non-isa/riscv-elf-psabi-doc#373 (comment)
Hmm, aren't [1] using the two-slot layout?
Para 2 in [1] says "When using multiple ways to access the same TLS variable, a maximum of 5 GOT slots are used." But only 2 slots are used for DESC, the other slots are used by GD or IE.
Para 2 in [1] says "When using multiple ways to access the same TLS variable, a maximum of 5 GOT slots are used." But only 2 slots are used for DESC, the other slots are used by GD or IE.
My misunderstanding is that, 4-slot is the second DESC slot is used to point to the GD two slots (for dynamic TLS), and 2-slot is only one of GD and DESC can exist, then they both use 2-slot. I'm confused about TLS. I need research it.
- The slow path of _dl_tlsdesc_dynamic calls __tls_get_addr, which in turn calls malloc (unless statically linked) and malloc may be interposed. An interposed malloc may clobber fcc register, so we either need to save/restore all fcc in the slow path, or tell the compiler a TLS descriptor usage may clobber the fcc registers. Which is better?
FWIW: AArch64 uses a clobber in the compiler.