loongson-community / discussions

Cross-community issue tracker & discussions / 跨社区工单追踪 & 讨论场所
7 stars 0 forks source link

ELF: Handle R_LARCH_PCALA64_* in a correct and reasonable way #17

Closed xry111 closed 5 months ago

xry111 commented 7 months ago

Background

For the extreme code model, we materialize the address of a symbol (either data or code) with:

pcalau12i $t0, %pc_hi20(sym)
addi.d $t1, $t1, %pc_lo12(sym)
lu32i.d $t1, %pc64_lo20(sym)
lu52i.d $t1, $t1, %pc64_hi12(sym)
addi.d $t0, $t0, $t1

Consider this example:

.text
.globl load_addr
load_addr:
    la.pcrel $a0, $t0, sym
    jr $ra
.data
sym:
    .dword 0

With cc bug.s -Ttext=0x180000ff8 -Tdata=0x1000000000 -shared -nostdlib we get:

0000000180000ff8 <load_addr>:
   180000ff8:   1b000004    pcalau12i       $a0, -524288
   180000ffc:   02c0000c    li.d            $t0, 0
   180001000:   160001cc    lu32i.d         $t0, 14
   180001004:   0300018c    lu52i.d         $t0, $t0, 0
   180001008:   0010b084    add.d           $a0, $a0, $t0
   18000100c:   4c000020    ret         

But this is wrong: the correct immediate in lu32i.d should be 15.

The problem is this "14" is calculated with the PC of the lu32i.d instruction (0x180001000), while in fact the PC of the pcalau12i instruction (0x180000ff8) shall be used.

Possible solution

Easy solution (limiting scheduling)

In GAS, emit 64-bit la.pcrel as-is:

pcalau12i $t0, %pc_hi20(sym)
addi.d $t1, $t1, %pc_lo12(sym)
lu32i.d $t1, %pc64_lo20(sym + 8)
lu52i.d $t1, $t1, %pc64_hi12(sym + 12)
addi.d $t0, $t0, $t1

In GCC, if -mexplicit-relocs=always, emit it as:

addi.d $t1, $t1, %pc_lo12(sym)
# The following three instructions must be kept intact, scheduling should not insert anything
pcalau12i $t0, %pc_hi20(sym)
lu32i.d $t1, %pc64_lo20(sym + 4)
lu52i.d $t1, $t1, %pc64_hi12(sym + 8)
# Until here
addi.d $t0, $t0, $t1

Hard solution (allowing scheduling)

For GAS, use the easy solution.

For GCC, introduce a new reloc type "R_LARCH_EFFECTIVE_PC" and do something like:

1:pcalau12i $t0, %pc_hi20(sym)
addi.d $t1, $t1, %pc_lo12(sym)
.reloc 0, R_LARCH_EFFECTIVE_PC 1b
lu32i.d $t1, %pc64_lo20(sym)
.reloc 0, R_LARCH_EFFECTIVE_PC 1b
lu52i.d $t1, $t1, %pc64_hi12(sym)
addi.d $t0, $t0, $t1
xry111 commented 7 months ago

Cc @xen0n @heiher @MaskRay @SixWeining @MQ-mengqing Ref https://github.com/llvm/llvm-project/pull/71907

xry111 commented 7 months ago

Note that the easy solution may blow up things like

la.pcrel $a0, $t0, array + 0xffffffff

because we cannot encode "0xffffffff + 8" in r_addend. So perhaps the "hard" solution is actually easier...

xen0n commented 7 months ago

In essence, the "hard" solution you've mentioned is for providing the necessary association between related relocs/insns, which does work, and is what RISC-V does (with their LO12 relocs referencing back to the HI20 reloc instead of the symbol) so most if not all of the machinery is already present.

Although I don't know if some kind of "macro-op fusion" in the micro-architecture would become possible if we abandon instruction scheduling for guaranteed adjacent immediate-loading insn snippets, given the additional relationship information also helps resolving other ambiguities (as can be seen in the comments in the LLD LoongArch code), I'd be in favor of the "hard" proposal too. The reloc name could use some bikeshedding but the info it provides is invaluable.

MQ-mengqing commented 7 months ago

这个easy solution其实并不easy,因为S+A的模式,重定位符号带Addend,所以effective PC并不是 current PC - addend。

la.pcrel $xx, $yy, sym + 8
->
000: pcalau12i(sym + 8)
004: addi.d(sym + 8)
008: lu32i.d(sym + 8 + 8)
00c: lu52i.d(sym + 8 + 12)

lu32i.d和lu52i.d 在ld里面的计算应该还是按照固定的方法计算PC。

64bit la.pcrel 现在几乎没出问题的原因可能是 它不常用,即使使用了,出现条件也是边缘情况很不容易触发。 通过easy solution的方法解决 可能 会降低代码修改成本,但长远来看有弊端。

hard solution与RISCV类似,可行性很高,能满足调度。但是的确和现在ABI不一样,相信改动会很大。

这里抛几个我的想法或者是疑惑,些许是题外的, 1, 目前PCALA_HI20这个没有溢出检测,这个问题mold维护者也提过 2, pcalau12i lu32id lu52id 这三条指令放一起的话,能否像call36一样只做一个重定位 3, 现在这套重定位还是 4KB 位置无关,以前那套pcadd12i+addi.d/pcadd12i+ori+lu32i.d+lu52i.d+add.d是 4B 位置无关,是不是也能加这样的2/4条指令连续的,像call36一样做一个重定位 4, 按我理解,之所以explicit relocs,(1)是因为可以参与调度,(2)是因为有些加载地址操作(假设加载32位),可以共用第一条pcalau12i(HI),即一个HI可以被多个LO使用。no explicit relocs 使得更方便做relax。( 猜测 )如果按照RISCV那样,指令不仅能参与调度,还能在LO的位置做重定位;如果HI被引用次数为0,甚至还能删除HI。 5, 如果要做hard solution,是否call36也会被修改为类似。 6, 如果要做hard solution,会导致基础重定位变化,应该会导致部分软件需要修改。(或者说像现在-meplicit-relocs一样加些自动判断?)

xen0n commented 7 months ago

如果要做hard solution,会导致基础重定位变化,应该会导致部分软件需要修改。(或者说像现在-meplicit-relocs一样加些自动判断?)

我理解这个地方的前向兼容性,主要在于新加的标记 reloc 会被不认识的组件当作未知记录,可能会无视,也可能报错;应该多数会报错。不过对于会无视新加标记的旧版本组件,或者支持新 ABI 的组件却处理旧的目标代码,那它们直接沿用老逻辑就行了。

xen0n commented 7 months ago

Also cc @abner-chenc -- the ultimate solution to this issue will likely involve code changes at Go side, so your team's input/acknowledgement is also welcome.

xry111 commented 7 months ago

如果要做hard solution,会导致基础重定位变化,应该会导致部分软件需要修改。(或者说像现在-meplicit-relocs一样加些自动判断?)

我理解这个地方的前向兼容性,主要在于新加的标记 reloc 会被不认识的组件当作未知记录,可能会无视,也可能报错;应该多数会报错。不过对于会无视新加标记的旧版本组件,或者支持新 ABI 的组件却处理旧的目标代码,那它们直接沿用老逻辑就行了。

A "fully backward-compatible" fix might be

# do not allow scheduling other instructions in-between them
.align 4
pcalau12i $t0, %pc_hi20(sym)
lu32i.d $t1, %pc64_lo20(sym)
lu52i.d $t1, $t1, %pc64_hi12(sym)

This guarantees that the pcalau12i, lu32i.d, and lu52i.d instructions are in the same 4K-page.

xry111 commented 5 months ago

Should be fixed with Binutils 2.42, GCC 14, and LLVM 18 (all following psABI 2.30).